Storing dynamic user generated fields in mongodb - python

I'm storing data in a collection for rules. These rules are user defined and can vary, and they will be used to check data from an API. I.e. a user may define that if the field spend (taken from an API) is higher than x do y (i.e. notify Slack).
Each "campaign" can have different rules, and users can set these based on a handful parameters, with a logic comparison to a defined value. A campaign can have many rules, and rules can be used across many campaigns.
It's for an internal tool written in python 3.5
My first intuition was something like:
`{
"rulename" : "test123",
"campaigns" : [
"123456",
"765434"
],
"triggers" : [
{
"impressions": "500",
"comparison": ">"
},
{
"cost": "1.5",
"comparison": ">"
}
],
"action" : "notify"
}`
Here's a picture of user input to better illustrate: https://www.dropbox.com/s/ai745inl2quwdh8/Screenshot%202017-07-09%2022.50.20.png?dl=0
The rules will be used with api requests, and if the rules are triggered (i.e. the API says impression are above 500 AND cost is higher than 1.5, then do something.
I hope it makes sense. Thanks in advance.

Related

Set context from custom XBRL file

I'm able to read a custom XBRL file. The problem is that the parsed object has the amounts of the initial period (last december) and not the last accountable period.
from xbrl import XBRLParser, GAAP, GAAPSerializer
# xbrl comes from python-xbrl package
xbrl_parser = XBRLParser()
with open('filename.xbrl') as file:
xbrl = xbrl_parser.parse(file)
custom_obj = xbrl_parser.parseCustom(xbrl)
print(custom_obj.cashandcashequivalents)
This prints the cash of 2021/12 not 2022/06 as expected
Current output: 100545101000
Expected: 81518021000
I think those number are the ones you can see in lines 9970 and 9972 of xbrl file.
These are the lines:
9970: <ifrs-full:CashAndCashEquivalents decimals="-3" contextRef="CierreTrimestreActual" unitRef="CLP">81518021000</ifrs-full:CashAndCashEquivalents>
9972: <ifrs-full:CashAndCashEquivalents decimals="-3" contextRef="SaldoActualInicio" unitRef="CLP">100545101000</ifrs-full:CashAndCashEquivalents>
How can I set the context/contextRef so the custom_obj has the numbers of the latest periods?
XBRL file: https://www.cmfchile.cl/institucional/inc/inf_financiera/ifrs/safec_ifrs_verarchivo.php?auth=&send=&rut=70016160&mm=06&aa=2022&archivo=70016160_202206_C.zip&desc_archivo=Estados%20financieros%20(XBRL)&tipo_archivo=XBRL
I've never used python-xbrl, but from a quick look at the source code it looks very basic and makes lots of unwarranted assumptions about the structure of the document. It doesn't appear to have any support for XBRL Dimensions, which the report you're using makes use of.
The module isn't built on a proper model of the XBRL data which would give you easy access to each fact's properties such as the period, and allow you to easily filter down to just the facts that you want.
I don't think the module will allow you to do what you want. Looking at this code it just iterates over all the facts, and sticks them onto properties on an object, so whichever fact it hits last in the document will be the one that you get, and given that order isn't important in XBRL files, it's just going to be pot luck which one you get.
I'd strongly recommend switching to a better XBRL library. Arelle is probably the most widely used, although you could also use my own pxp.
As an example, either tool can be used to convert the XBRL to JSON format, and will give you facts like this:
"f126928": {
"value": "81518021000",
"decimals": -3,
"dimensions": {
"concept": "ifrs-full:CashAndCashEquivalents",
"entity": "scheme:70016160-9",
"period": "2022-07-01T00:00:00",
"unit": "iso4217:CLP"
}
},
"f126930": {
"value": "100545101000",
"decimals": -3,
"dimensions": {
"concept": "ifrs-full:CashAndCashEquivalents",
"entity": "scheme:70016160-9",
"period": "2022-01-01T00:00:00",
"unit": "iso4217:CLP"
}
},
With this, you can then sort the facts by period, and then select the most recent one. Of course, you can do the same directly via the Python interfaces in these tools, rather than going via JSON.

How to ensure all data is captured in ES API?

I am trying to create an API in Python to pull the data from ES and feed it in a data warehouse. The data is live and being filled every second so I am going to create a near-real-time pipeline.
The current URL format is {{url}}/{{index}}/_search and the test payload I am sending is:
{
"from" : 0,
"size" : 5
}
On the next refresh it will pull using payload:
{
"from" : 6,
"size" : 5
}
And so on until it reaches the total amount of records. The PROD environment has about 250M rows and I'll set the size to 10 K per extract.
I am worried though as I don't know if the records are being reordered within ES. Currently, there is a plugin which uses a timestamp generated by the user but that is flawed as sometimes documents are being skipped due to a delay in the jsons being made available for extract in ES and possibly the way the time is being generated.
Does anyone know what is the default sorting when pulling the data using /_search?
I suppose what you're looking for is a streaming / changes API which is nicely described by #Val here and also an open feature request.
I the meantime, you cannot really count on the size and from parameters -- you could probably make redundant queries and handle the duplicates before they reach your data warehouse.
Another option would be to skip ES in this regard and stream directly to the warehouse? What I mean is, take an ES snapshot up until a given time once (so you keep the historical data), feed it to the warehouse, and then stream directly from where ever you're getting your data to the warehouse.
Addendum
AFAIK the default sorting is by the insertion date. But there's no internal _insertTime or similar.. You can use cursors -- it's called scrolling and here's a py implementation. But this goes from the 'latest' doc to the 'first', not vice versa. So it'll give you all the existing docs but I'm not so sure about the newly added ones while you were scrolling. You'd then wanna run the scroll again which is suboptimal.
You could also pre-sort your index which should work quite nicely for your use case when combined w/ scrolling.
Thanks for the responses. After consideration with my colleagues, we decided to implement and use the _ingest API instead to create a pipeline in ES which inserts the server document ingestion date on each doc.
Steps:
Create the timestamp pipeline
PUT _ingest/pipeline/timestamp_pipeline
{
"description" : "Inserts timestamp field for all documents",
"processors" : [
{
"set" : {
"field": "insert_date",
"value": "{{_ingest.timestamp}}"
}
}
]
}
Update indexes to add the new default field
PUT /*/_settings
{
"index" : {
"default_pipeline": "timestamp_pipeline"
}
}
In Python then I use the _scroll API like so:
es = Elasticsearch(cfg.esUrl, port = cfg.esPort, timeout = 200)
doc = {
"query": {
"range": {
"insert_date": {
"gte": lastRowDateOffset
}
}
}
}
res = es.search(
index = Index,
sort = "insert_date:asc",
scroll = "2m",
size = NumberOfResultsPerPage,
body = doc
)
Where lastRowDateOffset is the date of the last run

Google Docs API programmatically adding a table of content

I have a python script which does some analysis and output the results as text (paragraphs) on a Google Doc. I know how to insert text, update paragraph and text style through batchUpdate.
doc_service.documents().batchUpdate(documentId=<ID>,body={'requests': <my_request>}).execute()
where, for instance, "my_request" takes the form of something like:
request = [
{
"insertText": {
"location": {
"index": <index_position>,
"segmentId": <id>
},
"text": <text>
}
},
{
"updateParagraphStyle": {
"paragraphStyle": {
"namedStyleType": <paragraph_type>
},
"range": {
"segmentId": <id>,
"startIndex": <index_position>,
"endIndex": <index_position>
},
"fields": "namedStyleType"
}
},
]
However, once the script is done updating the table, it would be fantastic if a table of content could be added at the top of the document.
However, I am very new to Google Docs API and I am not entirely sure how to do that. I know I should use "TableOfContents" as a StructuralElement. I also know this option currently does not update automatically after each modification brought to the document (this is why I would like to create it AFTER the document has finished updating and place it at the top of the document).
How to do this with python? I am unclear where to call "TableOfContents" in my request.
Thank you so very much!
After your comment, I was able to understand better what you are desiring to do, but I came across these two Issue Tracker's posts:
Add the ability to generate and update the TOC of a doc.
Geting a link to heading paragraph.
These are well-known feature requests that unfortunately haven't been implemented yet. You can hit the ☆ next to the issue number in the top left on this page as it lets Google know more people are encountering this and so it is more likely to be seen faster.
Therefore, it's not possible to insert/update a table of contents programmatically.

Scrapy to crawl LD JSON data

I have done some research but can't seem to find any information on if it is possible to crawl something like JSON Schema data from a URL. An example i just found as i was looking at the product anyway would be:
https://www.reevoo.com/p/panasonic-nn-e271wmbpq
<script class="microdata-snippet" type="application/ld+json">
{
"#context": "http://schema.org/",
"#type": "Product",
"name": "PANASONIC NN-E271WMBPQ",
"image": "https://images.reevoo.com/products/3530/3530797/550x550.jpg?fingerprint=73ed91807dac7eb8f899757a348c735446d0a1fe&gravity=Center"
,"category": {
"#type": "Thing",
"name": "Microwave",
"url": "https://www.reevoo.com/browse/product_type/microwaves"
}
,"description": "Auto weight programs will automatically calculate the cooking time, once the weight has been entered. Acrylic lining makes cleaning easy, simply wipe after use. Child lock provides extra security to prevent little fingers interfering with the programming of the oven. \nAll our compact microwave ovens are packed with flexible features to make everyday cooking simple. Auto weight programs will automatically calculate the cooking time, once the weight has been entered. Acrylic lining makes cleaning easy, simply wipe after use. Child lock provides extra security to prevent little fingers interfering with the programming of the oven."
,"aggregateRating": {
"#type": "AggregateRating",
"ratingValue": "8.7",
"ratingCount": 636,
"worstRating": "1",
"bestRating": "10"
}
}
</script>
So would it be possible to extract say the rating data?
Thanks in advance,
import json
And next in your code:
microdata_content = response.xpath('//script[#type="application/ld+json"]/text()').extract_first()
microdata = json.loads(microdata_content)
ratingValue = microdata["aggregateRating"]["ratingValue"]

Validate object values against yaml configuration

I have an application where a nested Python dictionary is created based on a JSON document that I get as a response from an API. Example:
colleagues = [
{ "name": "John",
"skills": ["python", "java", "scala"],
"job": "developer"
},
{ "name": "George",
"skills": ["c", "go", "nodejs"],
"job": "developer"
}]
This dictionary can have many more nested levels.
What I want to do is let the user define their own arbitrary conditions (e.g. in order to find colleagues that have "python" among their skills, or whose name is "John") in a YAML configuration file, which I will use to check against the Python dictionary.
I thought about letting them configure that in the following manner in the YAML file, but this would require using exec(), which I want to avoid for security reasons:
constraints:
- "python" in colleagues[x]["skills"]
- colleagues[x]["name"] == "John"
What other options are there for such a problem, so that the user can specify their own constraints for the dictionary values? Again, the dictionary above is just an example. The actual one is much larger in size and nesting levels.
You could use a Lucene query parser to convert queries like "skill:python" and "name:John" to executable predicate functions, and then filter your list of colleagues using those predicates. Googling for "python lucene parser" will turn up several parsing options.

Categories

Resources