Reading xbrl with python - python

I am trying find particular tag in an xbrl file. I originally tried using python-xbrl package, but it is not exactly what I want, so I based my code on the one available from the package.
Here's the part of xbrl that I am interested in
<us-gaap:LiabilitiesCurrent contextRef="eol_PE2035----1510-Q0008_STD_0_20150627_0" unitRef="iso4217_USD" decimals="-6" id="id_5025426_6FEF05CB-B19C-4D84-AAF1-79B431731049_1_24">65285000000</us-gaap:LiabilitiesCurrent>
<us-gaap:Liabilities contextRef="eol_PE2035----1510-Q0008_STD_0_20150627_0" unitRef="iso4217_USD" decimals="-6" id="id_5025426_6FEF05CB-B19C-4D84-AAF1-79B431731049_1_28">147474000000</us-gaap:Liabilities>
Here is the code
python-xbrl package is based on beautifulsoup4 and several other packages.
liabilities = xbrl.find_all(name=re.compile("(us-gaap:Liabilities)",
re.IGNORECASE | re.MULTILINE))
I get the value for us-gaap:LiabilitiesCurrent, but I want value for us-gaap:Liabilities.
Right now as soon as it finds a match it, stores it. But in many cases its the wrong match due to the tag format in xbrl. I believe I need to change re.compile() part to make it work correctly.

I'd be very wary about using this approach to parsing XBRL (or indeed, any XML with namespaces in it). "us-gaap:Liabilities" is a QName, consisting of a prefix ("us-gaap") and a local name ("Liabilities"). The prefix is just a shorthand for a full namespace URI such as "http://fasb.org/us-gaap/2015-01-31", which is defined by a namespace declaration, usually at the top of the document. If you look at the top of the document you'll see something like:
xmlns:us-gaap="http://fasb.org/us-gaap/2015-01-31"
This means that within the scope of this document, "us-gaap" is taken to mean that full namespace URI.
XML creators are free to use whatever prefixes they want, so there is no guarantee that the element will actually be called "us-gaap:Liabilities" across all documents that you encounter.
beautifulsoup4 has very limited support for namespaces, so I wouldn't recommend it as a starting point for building an XBRL processor. It may be worth taking a look at the Arelle project, which is a full XBRL processor, and will make it easier to do other tasks such as finding the labels and other information associated with facts in the taxonomy.

Try it with a $ dollar sign at the end to indicate not to match anything else following the dollar sign:
liabilities = xbrl.find_all(name=re.compile("(us-gaap:Liabilities$)",
re.IGNORECASE | re.MULTILINE))

Related

How to Extract Versions from Software Packages

I'm trying to extract the version number from software packages hosted on SourceForge based on this Stack Overflow post. Specifically, I'm using the Release API and the "best_release.json" call. I have the following examples:
7-zip: https://sourceforge.net/projects/sevenzip/best_release.json
KeePass: https://sourceforge.net/projects/keepass/best_release.json
OpenOffice.org:
https://sourceforge.net/projects/openofficeorg.mirror/best_release.json
Using the following code snippet:
import requests
"""
Un/comment the following lines to change the project name and test
different responses.
"""
proj = "keepass"
# proj = "sevenzip"
# proj = "openofficeorg.mirror"
r = requests.get(f'https://sourceforge.net/projects/{proj}/best_release.json')
json_resp = r.json()
print(json_resp['release']['filename'])
I receive the respective results for each package:
7-Zip: /7-Zip/22.00/7z2200-linux-x86.tar.xz
KeePass: /KeePass 2.x/2.51.1/KeePass-2.51.1.zip
Openoffice.org: /extended/iso/en/OOo_3.3.0_Win_x86_install_en-US_20110219.iso
I'm wondering how I can extract the file versions from these disparate packages. Looking at the results, one can see that there are different naming conventions. For example, 7-Zip puts the file version as "22.00" in the second directory level. KeePass, however, puts it in the second directory level as well as the filename itself. OpenOffice.org puts it inside the filename.
Is there a way to do some sort of fuzzy match that can attempt to extract a "best guess" file version given a filename?
I thought of using regular expressions, re. For example, I can use the (\d+) capture group to capture one or more digits, as demonstrated here. However, this would also capture text such as "x86," which I don't want. I just desire some text that looks closest to a version number, but I'm unsure how to do this.

Python's ElementTree, how to create links in a paragraph

I have a website I'm building running off Python 2.7 and using ElementTree to build the HTML on the fly. I have no problem creating the elements and appending them to the tree. It's where I have to insert links in the middle of a large paragraph that I am stumped. This is easy when it's done in text, but this is doing it via XML. Here's what I mean:
Sample text:
lawLine = "..., a vessel as defined in Section 21 of the Harbors and Navigation Code which is inhabited and designed for habitation, an inhabited floating home as defined in subdivision (d) of Section 18075.55 of the Health and Safety Code, ..."
To add that text to the HTML as H4-style text, I typically use:
h4 = ET.Element('h4')
htmlTree.append(h4)
h4.text = lawLine
I need to add links at the word "Section" and the numbers associated with it, but I can't simply create a new element "a" in the middle of a paragraph and add it to the HTML tree, so I'm trying to build that piece as text, then do ET.fromstring and append it to the tree:
thisLawType = 'PC'
matches = re.findall(r'Section [0-9.]*', lawLine)
if matches:
lawLine = """<h4>{0}</h4>""".format(lawLine)
for thisMatch in matches:
thisMatchLinked = """{2}""".format(thisLawType, thisMatch.replace('Section ',''), thisMatch)
lawLine = lawLine.replace(thisMatch, thisMatchLinked)
htmlBody.append(ET.fromstring(lawLine))
I am getting "xml.etree.ElementTree.ParseError: not well-formed" errors when I do ET.fromstring. Is there a better way to do this in ElementTree? I'm sure there are better extensions out there, but my work environment is limited to Python 2.7 and the standard library. Any help would be appreciated. Thanks!
Evan
The xml you are generating is indeed not well formed, because of the presence of & in thisMatchLinked. It's one of the special charcters which need to be escaped (see an interesting explanation here).
So try replacing & with & and see if it works.

XBRL label names differ between instance and calculation documents

I have what is, probably, a very stupid question, but I'm stumped by it and would appreciate any help.
I'm trying to gather xbrl data from SEC filings using Python and BeautifulSoup. One problem I'm having is that certain line items are referred to differently in the instance document and the calculation linkbase.
As a concrete example, take this recent 10-K from PHI Group Inc.:
https://www.sec.gov/Archives/edgar/data/704172/000149315221015100/0001493152-21-015100-index.htm
A line item with the xbrl tag 'WriteoffOfFinancingCosts' shows up as
<PHIL:WriteoffOfFinancingCosts ...> in the instance document (along with a value and contexts)
but shows up as 'loc_PHILWriteoffOfFinancingCosts' in the calculation linkbase.
But this relationship, 'PHIL:' = 'loc_PHIL', isn't standard across XBRL filings. How does one know what prefix will be added to a tag in the calculation linkbase so that (with the prefix removed) it can be reliably tied back to a tag in the instance document?
I can think of various workarounds, but it just seems silly; isn't there somewhere I can look in the calculation linkbase or elsewhere that will just TELL me exactly what prefix is added?
As some (possibly relevant) nuance: lots of tags in lots of filings, of course, have a prefix like 'us-gaap', indicating the us-gaap namespace, but that doesn't seem to guarantee that a tag in the calculation linkbase will therefore look like 'us-gaapAccountsPayableCurrent' and not 'loc_us-gaapAccountsPayableCurrent' or 'us-gaap:AccountsPayableCurrent' or some other variation of the basic pattern, all of which, of course, look different to BeautifulSoup.
Can anyone point me in the right direction?
PHIL:WriteoffOfFinancingCosts is the name of the XBRL concept, while loc_PHILWriteoffOfFinancingCosts is the (calculation linkbase) label of the locator pointing to the concept PHIL:WriteoffOfFinancingCosts. This mechanism is the way linkbases connect concepts together: each locator is a "proxy" to a concept.
loc_PHILWriteoffOfFinancingCosts is thus an internal detail of the calculation linkbase. The names of linkbase labels are in principle "free to choose", however there are conventions that established themselves (such as prefixing with loc_) but I would not rely on them. Rather, you can "follow the trail" by looking at the definition of the linkbase label:
<link:loc xlink:type="locator"
xlink:href="phil-20200630.xsd#PHIL_WriteoffOfFinancingCosts"
xlink:label="loc_PHILWriteoffOfFinancingCosts" />
Where you see, thanks to the xlink:href attribute, that this locator points to the concept with the ID PHIL_WriteoffOfFinancingCosts in file phil-20200630.xsd.
<element id="PHIL_WriteoffOfFinancingCosts"
name="WriteoffOfFinancingCosts" .../>
And you can see that the local name of this concept is WriteoffOfFinancingCosts. It is in the namespace commonly associated with prefix PHIL: but never appears in a concept definition as all concepts in that file are in the namespace commonly associated with PHIL:. Now, how do we know this? because at the top of the xsd file, it says targetNamespace="http://phiglobal.com/20200630" and the prefix PHIL: is also attached to this namespace in the instance file phil-20200630.xml with xmlns:PHIL="http://phiglobal.com/20200630"
It is common practice to choose concept IDs with the prefix followed by underscore followed by the local name. Some users rely on it, but following the levels of indirection, in spite of being more complex, is "safer": linkbase label loc_PHILWriteoffOfFinancingCosts -> concept ID PHIL_WriteoffOfFinancingCosts -> concept local name WriteoffOfFinancingCosts -> concept's fully qualified name PHIL:WriteoffOfFinancingCosts.
You probably notice how complex this is. In fact, this is the reason why it is worth using an XBRL processor, which will do all of this for you.
#Ghislain Fourny: Many thanks. I'm glad to know that I wasn't crazy for finding the situation complex. Knowing now that the linkbase labels are "free-to-choose", here is the specific BeautifulSoup workaround that I've come up with, in case anyone is interested:
labeldict = {}
resp = requests.get(calcurl, headers = headers)
ctext = resp.text
soup = BeautifulSoup(ctext, 'lxml')
tags = soup.find_all()
for tag in tags:
if tag.name == 'link:loc':
if tag.has_attr('xlink:href') and tag.has_attr('xlink:label'):
href = tag['xlink:href']
firstsplit = href.split('#')[1] ## gets the part of the link after the pound symbol
value = firstsplit.split('_')[1] ## gets the part after the underscore
key = tag['xlink:label']
labeldict[key] = value
Which results in a dictionary where keys are the 'loc_Phil'-type label names and the values are the plain concept names, e.g. labeldict['loc_PHILWriteoffOfFinancingCosts'] = 'WriteoffOfFinancingCosts'
This assumes that xsd links will always follow a format of '...#..._concept'. I haven't found any that don't follow that format, but that's not a guarantee.

How to get all text inside XML tags

xml file snapshot
From above .xml file I am extracting article-id, article-title, abstract and keywords. For normal text inside single tag getting correct results. But text with multiple tags such as:
<title-group>
<article-title>
Acetylcholinesterase-Inhibiting Activity of Pyrrole Derivatives from a Novel Marine Gliding Bacterium,
<italic>Rapidithrix thailandica</italic>
</article-title>
</title-group>
.
.
same is for abstract...
I got output as:
OrderedDict([(u'italic**', u'Rapidithrix thailandica'), ('#text', u'Acetylcholines terase-Inhibiting Activity of Pyrrole Derivatives from a Novel Marine Gliding Ba cterium,')])
code has considered tag as a text and the o/p generated is also not in the sequence.
How to simply extract text from such input document as "Acetylcholinesterase-Inhibiting Activity of Pyrrole Derivatives from a Novel Marine Gliding Bacterium, Rapidithrix thailandica".
I am using below python code to perform above task..
import xmltodict
import os
from os.path import basename
import re
with open('2630847.nxml') as fd:
doc = xmltodict.parse(fd.read())
pmc_id = doc['article']['front']['article-meta']['article-id'][1]['#text']
article_title = doc['article']['front']['article-meta']['title-group']['article-title']
y = doc['article']['front']['article-meta']['abstract']
y = y.items()[0]
article_abstract = [g.encode('ascii','ignore') for g in y][1]
z = doc['article']['front']['article-meta']['kwd-group']['kwd']
zz = [g.encode('ascii','ignore') for g in z]
article_keywords = ",".join(zz).replace(","," ")
fout = open(str(pmc_id)+".txt","w")
fout.write(str(pmc_id)+"\n"+str(article_title)+". "+str(article_abstract)+". "+str(article_keywords))
Can somebody please suggest corrections..
xmltodict will likely be hard to use for your data. PMC journal articles are definitely not what the authors could have had in mind. Putting any but the most trivial XML into xmltodict is pounding a round peg into a square hole -- you might succeed, but it won't be pretty. I explain this further below under "tldr"....
Instead, I suggest you use a library whose data model fits your data better, such as xml.dom, minidom, or recent versions of BeautifulSoup. In many such libraries you just load the document with one call and then call some function like innerText() to get all the text content of it. You could even just load the document into a browser and call the Javascript innerText() function to get what you want. If the tool you choose doesn't provide innertext() already, it is:
def innertext(node):
t = ""
for curNode in node.childNodes:
if (isinstance(curNode, Text)):
t += curNode.nodeValue
elif (isinstance(curNode, Element)):
t += curNode.innerText
return(t)
You could tweak that to put spaces between the text nodes, depending on your data.
Hope that helps.
==tldr==
xmltodict makes an admirable attempt at making XML "as simple as possible"; but IMHO it errs in making it simpler than possible.
xmltodict basically works by turning every element into a dict, with its children as the dict items, keyed by their element names. But in many cases (such as yours), XML data isn't very much like that at all. For example, an element can have many children with the same name, but a dict can't.
So xmltodict has to do something special. It turns adjacent instances of the same element type into an array (without the element type). Here's an example excerpted from https://github.com/martinblech/xmltodict):
<and>
<many>elements</many>
<many>more elements</many>
</and>
becomes:
"and": {
"many": [
"elements",
"more elements"
]
},
First off, this means that xmltodict always loses the ordering information about child elements unless they are of the same type. So a section that contains a mix of paragraphs, lists, blockquotes, and so on, will either fail to load in xmltodict, or have all the scattered instances of each kind of child gathered together, completely losing their order.
The xmltodict approach also introduces frequent special-cases -- for example, you can't just get a list of all the children, or use len() to find out how many there are, etc. etc., because at every step you have to check whether you're really at a child element, or at a list of them.
Looking at xmltodict's own examples, you'll see that they mostly consist of walking down the tree by element names, but every now and then there's an integer subscript -- that's for the cases where these arrays are needed. But unless the data is unusually simple (which yours isn't), you won't know where that is. For example, if one DIV in an HTML document happens to contain only one P, the code to access the P needs one fewer subscript than with another DIV that happens to have more than one P.
It seems to me undesirable that the number of subscripts to get to something depends on how many siblings it has, and their types.
Alas, the structure still isn't good enough. Since child elements may have their own child elements, just making them strings in that extra array won't be enough. Sometimes they'll have to be dicts again, with some of their items in turn perhaps being arrays, some of whose items may be dicts, and so on. Writing the correct traversal algorithm to gather up the text is significantly harder than the DOM one shown above.
To be completely fair, there is some XML in which the order doesn't matter logically -- for example, you could export a SQL table into an XML file, using a container element for each record with a child element for each field. The order of fields is not information, so if you load such XML into xmltodict, losing the order doesn't matter. Likewise if you serialized Python data that was already just a dict. But those are very specialized edge cases. xmltodict might be an excellent choice for a case like that -- but the articles you're looking at are very far from that.

Python XML parsing - equivalent of "grep -v" in bash

This is one of my first forays into Python. I'd normally stick with bash, however Minidom seems to perfectly suite my needs for XML parsing, so I'm giving it a shot.
First question which I can't seem to figure out is, what's the equivalent for 'grep -v' when parsing a file?
Each object I'm pulling begins with a specific tag. If, within said tag, I want to exclude a row of data based off of a certain string embedded within the tag, how do I accomplish this?
Pseudo code that I've got now (no exclusion):
mainTag = xml.getElementsByTagName("network_object")
name = network_object.getElementsByTagName("Name")[0].firstChild.data
I'd like to see the data output all "name" fields, with the exception of strings that contain "cluster". Since I'll be doing multiple searches on network_objects, I believe I need to do it at that level, but don't know how.
Etree is giving me a ton of problems, can you give me some logic to do this with minidom?
This obviously doesn't work:
name = network_object.getElementsByTagName("Name")[0].firstChild.data
if name is not 'cluster' in name
continue
First of all, step away from the minidom module. Minidom is great if you already know the DOM from other languages and really do not want to learn any other API. There are easier alternatives available, right there in the standard library. I'd use the ElementTree API instead.
You generally just loop over matches, and skip over the ones that you want to exclude as you do so:
from xml.etree import ElementTree
tree = ElementTree.parse(somefile)
for name in tree.findall('.//network_object//Name'):
if name.text is not None and 'cluster' in name.text:
continue # skip this one

Categories

Resources