Examine and tweak a given analyzer? - python

I'm using the French analyzer.
Having examined the output from IndexClient.analyze(...) for this analyzer I'm a little unhappy with some of the stopwords (e.g. the expression 'ayant-cause' comes out as 'caus', because 'ayant' is a stopword: French stopwords).
How do I go about examining these stopwords and then tweaking them? Do I have to create a custom analyzer based on the existing French one? Or can I directly tweak the French one?
NB I am using the Python elasticsearch module ("thin client"), but an answer in terms of REST commands would be fine.

Yes, you can easily tweak the existing analyzer and examine them using the Analyze API of elasticsearch
Ultimately analyzer is made of three things, char filter, tokeniser and token-filter and you can create your own combination of these things to build your own custom analyzer and test it using the REST API.

Spent quite a bit of time figuring out at least a workaround arrangement.
Having downloaded that French stop-words file from Github I then edited it (e.g. to exclude "ayant"). Currently residing in the "config" directory of my installed ES setup (although you can set an absolute path).
Then I made my settings/mappings object like this:
{
'settings' : {
'analysis' : {
'analyzer' : {
'tweaked_french' : {
'type' : 'french',
# NB W10, config path currently D:\apps\ElasticSearch\elasticsearch-7.10.2\config
'stopwords_path' : 'tweaked_french_stop.txt',
},
},
},
},
'mappings': {
'dynamic': 'strict',
'properties': {
'my_french_field' : {
'type' : 'text',
'term_vector' : 'with_positions_offsets',
'fields' : {
'french' : {
'type' : 'text',
'analyzer' : 'tweaked_french',
'term_vector' : 'with_positions_offsets',
},
},
},
},
},
}
What is then rather wonderful is that, according to my experiments, you can get a query object to find and use that custom-built analyser (i.e. it's there and available, in the installed index). So your query object is relatively simple:
{
'query': {
'simple_query_string': {
'query': query_text,
'fields': [
'my_french_field.french',
],
'analyzer' : 'tweaked_french',
},
},
'highlight': {
'fields': {
'my_french_field.french': {
'type': 'fvh',
...
},
},
'number_of_fragments': 0
}
}
After that you can query in French: your query gets stemmed and the result is used for the search. If "ayant" is a word in your query string, it will now return hits including "ayant-cause", proving that both the query and the mapping spec are using the tweaked stop-word list.
I'd still like to know whether a way exists not involving using an external file, i.e. of editing on-the-fly what is already there (or of just seeing what it already there...).

Related

Using Bad Json in Python

I am having a json in a file which i want to access in my Python Code. The Json file looks like :
{
"fc1" : {
region : "Delhi",
marketplace : "IN"
},
"fc2" : {
region : "Rajasthan",
marketplace : "IN"
}
}
The above json i want to use in my Python code. I want to access according to its keys("fc1", "fc2")
Since this is not like actual json, i am facing difficulty in accessing the values in json.
Is there any way in python language to access these type of json.
Thanks.
I agree with the comment that, if you generated that file, then you should put quotes around region and marketplace when generating it (or have the person who generated it do the same). However, if this absolutely isn't an option for whatever reason, the following approach might work:
import json
data_string = """
{
"fc1":{
region:"Delhi",
marketplace: "IN"
},
"fc2" : {
region:"Rajasthan",
marketplace: "IN"
}
}
"""
data = json.loads(data_string.replace('region', '"region"').replace('marketplace', '"marketplace"'))
data
>>>{'fc1': {'region': 'Delhi', 'marketplace': 'IN'},
'fc2': {'region': 'Rajasthan', 'marketplace': 'IN'}}
Note that you would have to do the same for any unquoted key.
There is module dirtyjson which reads this incorrect JSON.
import dirtyjson
data_string = """
{
"fc1":{
region:"Delhi",
marketplace: "IN"
},
"fc2" : {
region:"Rajasthan",
marketplace: "IN"
}
}
"""
data = dirtyjson.loads(data_string)
print(data)
print(data['fc1'])
print(data['fc2'])

Change the font of an entire document without affecting formatting using Google Docs API

I am trying to change the font of an entire Google Doc using the API. The purpose is to let users of our application export documents with their company’s font.
This is what I am currently doing:
from googleapiclient.discovery import build
doc_service = build("docs", "v1")
document = self.doc_service.documents().get(documentId="[Document ID]").execute()
requests = []
for element in document["body"]["content"]:
if "sectionBreak" in element:
continue # Trying to change the font of a section break causes an error
requests.append(
{
"updateTextStyle": {
"range": {
"startIndex": element["startIndex"],
"endIndex": element["endIndex"],
},
"textStyle": {
"weightedFontFamily": {
"fontFamily": "[Font name]"
},
},
"fields": "weightedFontFamily",
}
}
)
doc_service.documents().batchUpdate(
documentId=self.copy_id, body={"requests": requests}
).execute()
The code above changes the font, but it also removes any bold text formatting because it overrides the entire style of an element. Some options I have looked into:
DocumentStyle
Documents have a DocumentStyle property, but it does not contain any font information.
NamedStyles
Documents also have a NamedStyles property. It contains styles like NORMAL_TEXT and HEADING_1. I could loop through all these and change their textStyle.weightedFontFamily. This would be the ideal solution, because it would keep style information where it belongs. But I have not found a way to change NamedStyles using the API.
Deeper loop
I could continue with my current approach, looping through the elements list on each element, keeping everything but the font from textStyle (which contains things like bold: true). However, our current approach already takes too long to execute, and such an approach would be both slower and more brittle, so I would like to avoid this.
Answer:
Extract the textStyle out of the current element and only change/add the weightedFontFamily/fontFamily object.
Code Example:
for element in document["body"]["content"]:
if "sectionBreak" in element:
continue # Trying to change the font of a section break causes an error
textStyle = element["paragraph"]["elements"][0]["textStyle"]
textStyle["weightedFontFamily"]["fontFamily"] = "[Font name]"
requests.append(
{
"updateTextStyle": {
"range": {
"startIndex": element["startIndex"],
"endIndex": element["endIndex"],
},
"textStyle": textStyle,
"fields": "weightedFontFamily",
}
}
)
doc_service.documents().batchUpdate(
documentId=self.copy_id, body={"requests": requests}
).execute()
This seems to work for me, even with section breaks in between, and at the end of the document. You might want to explore more corner cases..
This basically tries to mimic the SelectAll option
document = service.documents().get(documentId=doc_id).execute()
endIndex = sorted(document["body"]["content"], key=lambda x: x["endIndex"], reverse=True,)[0]["endIndex"]
service.documents().batchUpdate(
documentId=doc_id,
body={
"requests": {
"updateTextStyle": {
"range": {
"endIndex": endIndex,
"startIndex": 1,
},
"fields": "fontSize",
"textStyle": {"fontSize": {"magnitude": 100, "unit": "pt"}},
}
}
},
).execute()
Same should work for other fields too.
However, if you are going to just share a docx file to all the clients, you could keep a local copy of the PDF / DOCX and then modify those. It is fairly easy to work around the styles in DOCX (it is a bunch of xml files)
Use this to explore and update DOCX files OOXML Tools Chrome Extension
Similarly PDFs are key-values pairs stored as records. Check this: ReportLab

How does Python JSON library dealing with time?

So I'm currently learning MongoDB and I'm using PyMongo rather than MongoDB shell.
When I started trying the basic CRUD operations, I found it is hard to load the bios data using PyMongo, since the original data posted on the website had a strange ISODATA for time.
The original python JSON library seemed to be not support this and the mongoimport seemed to be not support this either(not sure). But I found this, after modifying into {$date:"2017-04-01T05:00:00Z"}, mongoimport was working.
Right now I'm using subprocess to call a external command to import the data. So my question is, how to use python correctly read the JSON data and using PyMongo to insert the data.
Details
the bios data in the mongodb documentation looks like this
{
"_id" : 1,
"name" : {
"first" : "John",
"last" : "Backus"
},
"birth" : ISODate("1924-12-03T05:00:00Z"),
"death" : ISODate("2007-03-17T04:00:00Z"),
"contribs" : [
"Fortran",
"ALGOL",
"Backus-Naur Form",
"FP"
],
"awards" : [
{
"award" : "W.W. McDowell Award",
"year" : 1967,
"by" : "IEEE Computer Society"
},
{
"award" : "National Medal of Science",
"year" : 1975,
"by" : "National Science Foundation"
},
{
"award" : "Turing Award",
"year" : 1977,
"by" : "ACM"
},
{
"award" : "Draper Prize",
"year" : 1993,
"by" : "National Academy of Engineering"
}
]
}
And when I try to parse it with Python's JSON library, I get a error messagejson.decoder.JSONDecodeError because of the "birth" : ISODate("1924-12-03T05:00:00Z"),. And mongoimport can not parse this because of the same reason.
When I modified,
"birth" : ISODate("1924-12-03T05:00:00Z"), into
"birth" : $date:"2017-04-01T05:00:00Z"
mongoimport was working but python still wasn't able to parse it.
What I am asking here is a way to deal this problem within Python and PyMongo rather than calling a external commands.
The example that you're looking at was probably intended to be used within the mongo shell, where the use of the ISODate bson type can be parsed as shown.
Outside of that, we have the challenge that JSON does not have a date datatype, nor does it have a standard way of representing dates. To deal with this challenge, MongoDB created something called Extended JSON, which can encode dates in JSON similar to how you have shown with $date.
In order to work with Extended JSON in Python / PyMongo, you could use json_util.
Here's a brief example:
from bson.json_util import loads
from pymongo import MongoClient
json = '''
{
"_id" : 1,
"name" : {
"first" : "John",
"last" : "Backus"
},
"birth" : {"$date":"2017-04-01T05:00:00.000Z"},
"death" : {"$date":"2017-04-01T05:00:00.000Z"}
}
'''
bson = loads(json)
print(str(bson))
db = MongoClient().test
collection = db.bios
collection.insert(bson)

How do I generate python class source code from JSON? [duplicate]

Is there a python library for converting a JSON schema to a python class definition, similar to jsonschema2pojo -- https://github.com/joelittlejohn/jsonschema2pojo -- for Java?
So far the closest thing I've been able to find is warlock, which advertises this workflow:
Build your schema
>>> schema = {
'name': 'Country',
'properties': {
'name': {'type': 'string'},
'abbreviation': {'type': 'string'},
},
'additionalProperties': False,
}
Create a model
>>> import warlock
>>> Country = warlock.model_factory(schema)
Create an object using your model
>>> sweden = Country(name='Sweden', abbreviation='SE')
However, it's not quite that easy. The objects that Warlock produces lack much in the way of introspectible goodies. And if it supports nested dicts at initialization, I was unable to figure out how to make them work.
To give a little background, the problem that I was working on was how to take Chrome's JSONSchema API and produce a tree of request generators and response handlers. Warlock doesn't seem too far off the mark, the only downside is that meta-classes in Python can't really be turned into 'code'.
Other useful modules to look for:
jsonschema - (which Warlock is built on top of)
valideer - similar to jsonschema but with a worse name.
bunch - An interesting structure builder thats half-way between a dotdict and construct
If you end up finding a good one-stop solution for this please follow up your question - I'd love to find one. I poured through github, pypi, googlecode, sourceforge, etc.. And just couldn't find anything really sexy.
For lack of any pre-made solutions, I'll probably cobble together something with Warlock myself. So if I beat you to it, I'll update my answer. :p
python-jsonschema-objects is an alternative to warlock, build on top of jsonschema
python-jsonschema-objects provides an automatic class-based binding to JSON schemas for use in python.
Usage:
Sample Json Schema
schema = '''{
"title": "Example Schema",
"type": "object",
"properties": {
"firstName": {
"type": "string"
},
"lastName": {
"type": "string"
},
"age": {
"description": "Age in years",
"type": "integer",
"minimum": 0
},
"dogs": {
"type": "array",
"items": {"type": "string"},
"maxItems": 4
},
"gender": {
"type": "string",
"enum": ["male", "female"]
},
"deceased": {
"enum": ["yes", "no", 1, 0, "true", "false"]
}
},
"required": ["firstName", "lastName"]
} '''
Converting the schema object to class
import python_jsonschema_objects as pjs
import json
schema = json.loads(schema)
builder = pjs.ObjectBuilder(schema)
ns = builder.build_classes()
Person = ns.ExampleSchema
james = Person(firstName="James", lastName="Bond")
james.lastName
u'Bond' james
example_schema lastName=Bond age=None firstName=James
Validation :
james.age = -2
python_jsonschema_objects.validators.ValidationError: -2 was less
or equal to than 0
But problem is , it is still using draft4validation while jsonschema has moved over draft4validation , i filed an issue on the repo regarding this .
Unless you are using old version of jsonschema , the above package will work as shown.
I just created this small project to generate code classes from json schema, even if dealing with python I think can be useful when working in business projects:
pip install jsonschema2popo
running following command will generate a python module containing json-schema defined classes (it uses jinja2 templating)
jsonschema2popo -o /path/to/output_file.py /path/to/json_schema.json
more info at: https://github.com/frx08/jsonschema2popo

Python - how to avoid exec for batching?

I have an existing python application (limited deployment) that requires the ability to run batches/macros (ie do foo 3 times, change x, do y). Currently I have this implemented as exec running through a text file which contains simple python code to do all the required batching.
However exec is messy (ie security issues) and there are also some cases where it doesn't act exactly the same as actually having the same code in your file. How can I get around using exec? I don't want to write my own mini-macro language, and users need to use multiple different macros per session, so I can't setup it such that the macro is a python file that calls the software and then runs itself or something similar.
Is there a cleaner/better way to do this?
Pseudocode: In the software it has something like:
-when a macro gets called
for line in macrofile:
exec line
and the macrofiles are python, ie something like:
property_of_software_obj = "some str"
software_function(some args)
etc.
Have you considered using a serialized data format like JSON? It's lightweight, can easily translate to Python dictionaries, and all the cool kids are using it.
You could construct the data in a way that is meaningful, but doesn't require containing actual code. You could then read in that construct, grab the parts you want, and then pass it to a function or class.
Edit: Added a pass at a cheesy example of a possible JSON spec.
Your JSON:
{
"macros": [
{
"function": "foo_func",
"args": {
"x": "y",
"bar": null
},
"name": "foo",
"iterations": 3
},
{
"function": "bar_func",
"args": {
"x": "y",
"bar": null
},
"name": "bar",
"iterations": 1
}
]
}
Then you parse it with Python's json lib:
import json
# Get JSON data from elsewhere and parse it
macros = json.loads(json_data)
# Do something with the macros
for macro in macros:
run_macro(macro) # For example
And the resulting Python data is almost identical syntactically to JSON aside from some of the keywords like True, False, None (true, false, null in JSON).
{
'macros': [
{
'args':
{
'bar': None,
'x': 'y'
},
'function': 'foo_func',
'iterations': 3,
'name': 'foo'
},
{
'args':
{
'bar': None,
'x': 'y'
},
'function': 'bar_func',
'iterations': 1,
'name': 'bar'
}
]
}

Categories

Resources