Set context from custom XBRL file - python

I'm able to read a custom XBRL file. The problem is that the parsed object has the amounts of the initial period (last december) and not the last accountable period.
from xbrl import XBRLParser, GAAP, GAAPSerializer
# xbrl comes from python-xbrl package
xbrl_parser = XBRLParser()
with open('filename.xbrl') as file:
xbrl = xbrl_parser.parse(file)
custom_obj = xbrl_parser.parseCustom(xbrl)
print(custom_obj.cashandcashequivalents)
This prints the cash of 2021/12 not 2022/06 as expected
Current output: 100545101000
Expected: 81518021000
I think those number are the ones you can see in lines 9970 and 9972 of xbrl file.
These are the lines:
9970: <ifrs-full:CashAndCashEquivalents decimals="-3" contextRef="CierreTrimestreActual" unitRef="CLP">81518021000</ifrs-full:CashAndCashEquivalents>
9972: <ifrs-full:CashAndCashEquivalents decimals="-3" contextRef="SaldoActualInicio" unitRef="CLP">100545101000</ifrs-full:CashAndCashEquivalents>
How can I set the context/contextRef so the custom_obj has the numbers of the latest periods?
XBRL file: https://www.cmfchile.cl/institucional/inc/inf_financiera/ifrs/safec_ifrs_verarchivo.php?auth=&send=&rut=70016160&mm=06&aa=2022&archivo=70016160_202206_C.zip&desc_archivo=Estados%20financieros%20(XBRL)&tipo_archivo=XBRL

I've never used python-xbrl, but from a quick look at the source code it looks very basic and makes lots of unwarranted assumptions about the structure of the document. It doesn't appear to have any support for XBRL Dimensions, which the report you're using makes use of.
The module isn't built on a proper model of the XBRL data which would give you easy access to each fact's properties such as the period, and allow you to easily filter down to just the facts that you want.
I don't think the module will allow you to do what you want. Looking at this code it just iterates over all the facts, and sticks them onto properties on an object, so whichever fact it hits last in the document will be the one that you get, and given that order isn't important in XBRL files, it's just going to be pot luck which one you get.
I'd strongly recommend switching to a better XBRL library. Arelle is probably the most widely used, although you could also use my own pxp.
As an example, either tool can be used to convert the XBRL to JSON format, and will give you facts like this:
"f126928": {
"value": "81518021000",
"decimals": -3,
"dimensions": {
"concept": "ifrs-full:CashAndCashEquivalents",
"entity": "scheme:70016160-9",
"period": "2022-07-01T00:00:00",
"unit": "iso4217:CLP"
}
},
"f126930": {
"value": "100545101000",
"decimals": -3,
"dimensions": {
"concept": "ifrs-full:CashAndCashEquivalents",
"entity": "scheme:70016160-9",
"period": "2022-01-01T00:00:00",
"unit": "iso4217:CLP"
}
},
With this, you can then sort the facts by period, and then select the most recent one. Of course, you can do the same directly via the Python interfaces in these tools, rather than going via JSON.

Related

How can I transform or query JSON data in my Github Workflow?

I am using the octokit/request-action action in my Github Workflow to obtain information about a certain Github Release via its tag name:
- name: Get Release Info
uses: octokit/request-action#v2.x
id: release
with:
route: GET /repos/{org_repo}/releases/tags/{tag}
The response JSON structure contains information about all of the assets attached to the release. I need to use this information to build a list of download URLs for all the release artifacts and pass that into a Python script that builds a discord notification.
Using the Github Workflow YAML, how do I "query" the JSON data (similar to JSONata) to obtain a list of all the browser_download_url values inside the assets array? The data returned looks like this (trimmed):
{
"url": "https://api.github.com/repos/octocat/Hello-World/releases/1",
"id": 1,
"assets": [
{
"url": "https://api.github.com/repos/octocat/Hello-World/releases/assets/1",
"browser_download_url": "https://github.com/octocat/Hello-World/releases/download/v1.0.0/example1.zip",
"id": 1
},
{
"url": "https://api.github.com/repos/octocat/Hello-World/releases/assets/2",
"browser_download_url": "https://github.com/octocat/Hello-World/releases/download/v1.0.0/example2.zip",
"id": 2
}
]
}
The end result I want is a way to pass the two download URLs above to my script like so (using a separate step in my workflow):
python discord_notification.py "https://github.com/octocat/Hello-World/releases/download/v1.0.0/example1.zip" "https://github.com/octocat/Hello-World/releases/download/v1.0.0/example2.zip"
(Exact syntax can vary; the above snippet is just an example)
It's possible that what I want to do just can't be achieved in the workflow YAML itself. If that's the case, I'd be OK with a solution that involves passing all or part of the response JSON to the Python script and use Python itself to parse the JSON data. I just don't know if Bash adds a layer of complexity that will make passing a multi-line response string as a parameter difficult.

Apache Spark / PySpark, defining custom JSON Schema for Dynamic Keys

I have a bunch of JSON files, and suppose each have the following structure:
{
"fields": {
"name": "Bob",
"key": "bob"
},
"results": {
"bob": { ... }
}
}
Where by some unfortunate reason, while the structure of the JSON is fairly consistent, there is one dynamic key under "results". Defining the schema for under the fields is fairly straight-forward to me.
So, for several JSON files, the final schema might be:
fieldSchema = StructField(...)
resultSchema = StructField("results", StructType([StructField("bob", ...)]))
finalSchema = StructType([fieldSchema, resultsSchema])
Where the problem is this line: StructField("bob", ...)
Obviously, bob is not the key I'm looking for. This name for the StructField would ideally be some kind of wildcard character, regex pattern, or worst case, some dynamic field based on other fields.
I'm a newbie to Spark and have been scouring the documentation and historical StackOverflow posts, but I've been unable to find anything.
Long story short, I want to be able to pass some kind of wide net for the name parameter in StructField to encompass a variety of different keys, similar to a regex pattern.

Google Docs API programmatically adding a table of content

I have a python script which does some analysis and output the results as text (paragraphs) on a Google Doc. I know how to insert text, update paragraph and text style through batchUpdate.
doc_service.documents().batchUpdate(documentId=<ID>,body={'requests': <my_request>}).execute()
where, for instance, "my_request" takes the form of something like:
request = [
{
"insertText": {
"location": {
"index": <index_position>,
"segmentId": <id>
},
"text": <text>
}
},
{
"updateParagraphStyle": {
"paragraphStyle": {
"namedStyleType": <paragraph_type>
},
"range": {
"segmentId": <id>,
"startIndex": <index_position>,
"endIndex": <index_position>
},
"fields": "namedStyleType"
}
},
]
However, once the script is done updating the table, it would be fantastic if a table of content could be added at the top of the document.
However, I am very new to Google Docs API and I am not entirely sure how to do that. I know I should use "TableOfContents" as a StructuralElement. I also know this option currently does not update automatically after each modification brought to the document (this is why I would like to create it AFTER the document has finished updating and place it at the top of the document).
How to do this with python? I am unclear where to call "TableOfContents" in my request.
Thank you so very much!
After your comment, I was able to understand better what you are desiring to do, but I came across these two Issue Tracker's posts:
Add the ability to generate and update the TOC of a doc.
Geting a link to heading paragraph.
These are well-known feature requests that unfortunately haven't been implemented yet. You can hit the ☆ next to the issue number in the top left on this page as it lets Google know more people are encountering this and so it is more likely to be seen faster.
Therefore, it's not possible to insert/update a table of contents programmatically.

Validate object values against yaml configuration

I have an application where a nested Python dictionary is created based on a JSON document that I get as a response from an API. Example:
colleagues = [
{ "name": "John",
"skills": ["python", "java", "scala"],
"job": "developer"
},
{ "name": "George",
"skills": ["c", "go", "nodejs"],
"job": "developer"
}]
This dictionary can have many more nested levels.
What I want to do is let the user define their own arbitrary conditions (e.g. in order to find colleagues that have "python" among their skills, or whose name is "John") in a YAML configuration file, which I will use to check against the Python dictionary.
I thought about letting them configure that in the following manner in the YAML file, but this would require using exec(), which I want to avoid for security reasons:
constraints:
- "python" in colleagues[x]["skills"]
- colleagues[x]["name"] == "John"
What other options are there for such a problem, so that the user can specify their own constraints for the dictionary values? Again, the dictionary above is just an example. The actual one is much larger in size and nesting levels.
You could use a Lucene query parser to convert queries like "skill:python" and "name:John" to executable predicate functions, and then filter your list of colleagues using those predicates. Googling for "python lucene parser" will turn up several parsing options.

What is the best way to search millions of JSON files?

I've very recently picked up programming in Python and am working on creating a database.
I've already worked out extracting all these files from their source so they are all in a directory on my computer.
All of these files are structured the same way and what I want to do is search these multidimensional dictionaries and locate the value for a specific set of keys.
These json files are all structured similarly,
{
"userid": 34535367,
"result": {
"list": [
{
"name": 264,
"age": 64,
"id": 456345345
},
{
"name": 263,
"age": 42,
"id": 364563463456
}
]
}
}
In my case, I would like to search for the "name" key and return the relevant data(quality, id and the original userid) for the thousands of names just like it from my millions of JSON files.
Basically I'm very new at this and the little programming knowledge I have is in Python. I'm happy to start learning whatever I need to, but I'm not sure which direction to go.
If your goal is to create a database, then you should look on how databases work and solve the same problem you are trying to solve right now :)
NoSQL databases (like mangodb) work also with json documents and implements most likely a whole set of tools to search and filter documents.
Now to answer your question, there is no quick way to do so unless you do some preprocessing, meaning that you store different information about the data (called metadata).
This is a huge subject and I don't have enough expertise to give you all the answers, but I can give you a simple tip: Use indexes.
An index is a sorted key/value map where for every value, we store the documents that contains that value (or the file + position of the Json document) . For example an index for the name property would like this:
{
263: ('jsonfile10.json', '0')
264: ('jsonfile10.json', '30'),
# The json document can be found on the jsonfile10.json file on line 30
}
By keeping an index for the most queried values, you can turn a linear time search into a logarithmic time search not to mention that inserting a new document is much faster. in your case, you seems to only need an index on the name field.
Creating/updating the index is done when you insert, update or remove a document. Using a balanced binary tree can accelerate the updates on the index.
As a suggestion, why don't you just process all the incoming files and insert the data into a database? You will have a toolset to query that database. SQLite for example will do (as well as any other more sophisticated database):
http://www.sqlite.org/
http://docs.python.org/2/library/sqlite3.html
Simple other solution might be to build a file mapping name_id to /file/path. Then you can logarithmically do a binary search by the name id. But I'd still advise using a proper database as maintaining the index will be more cumbersome than doing some inserts/deletes.

Categories

Resources