I am using pdfquery to extract data from PDF.
My pdf tree xml looks like following:
<LTTextLineHorizontal>
<LTTextBoxHorizontal>Address</LTTextBoxHorizontal>
</LTTextLineHorizontal>
<LTTextBoxHorizontal>
<LTTextBoxHorizontal>
First-Name
</LTTextBoxHorizontal>
<LTTextBoxHorizontal>
Last-Name
</LTTextBoxHorizontal>
</LTTextBoxHorizontal>
The idea is to make a string that is Address First-Name Last-Name
The need then arises to select child elements depending on their existence, i am at a loss on how to do it.
You'll want to use .extract with the with_parents keyword in order to extract the children. The documentation gives a decent example
pdf.extract([
('with_parent','LTPage[page_index="1"]'),
('last_name', ':in_bbox("315,680,395,700")')
])
In this case, you're simply limiting the search to page 1 of the document. However, you can also pass in the result of previous selections with the with_parent keyword.
For example, if, in your example, Address had children (street, city, zipcode), you could first find the address section, store the element as a variable and then us .extract to pull out the children. How you store and structure the resulting data will depend on your ultimate needs.
address = pdf.pdfquery('LTTextBoxHorizontal:contains("Address")')
pdf.extract([
('with_parent', address),
(..., ...)])
In many cases, the children are not necessarily nested within the xml tree and you need to resort to bbox based approach. What I do in that case is construct a bbox using the "parent" as the top boundary and the next known non-child as the bottom boundary and then pass that in to .extract. Just remember that bboxes are constructed with the bottom-left, top-right coordinates.
Related
I have what is, probably, a very stupid question, but I'm stumped by it and would appreciate any help.
I'm trying to gather xbrl data from SEC filings using Python and BeautifulSoup. One problem I'm having is that certain line items are referred to differently in the instance document and the calculation linkbase.
As a concrete example, take this recent 10-K from PHI Group Inc.:
https://www.sec.gov/Archives/edgar/data/704172/000149315221015100/0001493152-21-015100-index.htm
A line item with the xbrl tag 'WriteoffOfFinancingCosts' shows up as
<PHIL:WriteoffOfFinancingCosts ...> in the instance document (along with a value and contexts)
but shows up as 'loc_PHILWriteoffOfFinancingCosts' in the calculation linkbase.
But this relationship, 'PHIL:' = 'loc_PHIL', isn't standard across XBRL filings. How does one know what prefix will be added to a tag in the calculation linkbase so that (with the prefix removed) it can be reliably tied back to a tag in the instance document?
I can think of various workarounds, but it just seems silly; isn't there somewhere I can look in the calculation linkbase or elsewhere that will just TELL me exactly what prefix is added?
As some (possibly relevant) nuance: lots of tags in lots of filings, of course, have a prefix like 'us-gaap', indicating the us-gaap namespace, but that doesn't seem to guarantee that a tag in the calculation linkbase will therefore look like 'us-gaapAccountsPayableCurrent' and not 'loc_us-gaapAccountsPayableCurrent' or 'us-gaap:AccountsPayableCurrent' or some other variation of the basic pattern, all of which, of course, look different to BeautifulSoup.
Can anyone point me in the right direction?
PHIL:WriteoffOfFinancingCosts is the name of the XBRL concept, while loc_PHILWriteoffOfFinancingCosts is the (calculation linkbase) label of the locator pointing to the concept PHIL:WriteoffOfFinancingCosts. This mechanism is the way linkbases connect concepts together: each locator is a "proxy" to a concept.
loc_PHILWriteoffOfFinancingCosts is thus an internal detail of the calculation linkbase. The names of linkbase labels are in principle "free to choose", however there are conventions that established themselves (such as prefixing with loc_) but I would not rely on them. Rather, you can "follow the trail" by looking at the definition of the linkbase label:
<link:loc xlink:type="locator"
xlink:href="phil-20200630.xsd#PHIL_WriteoffOfFinancingCosts"
xlink:label="loc_PHILWriteoffOfFinancingCosts" />
Where you see, thanks to the xlink:href attribute, that this locator points to the concept with the ID PHIL_WriteoffOfFinancingCosts in file phil-20200630.xsd.
<element id="PHIL_WriteoffOfFinancingCosts"
name="WriteoffOfFinancingCosts" .../>
And you can see that the local name of this concept is WriteoffOfFinancingCosts. It is in the namespace commonly associated with prefix PHIL: but never appears in a concept definition as all concepts in that file are in the namespace commonly associated with PHIL:. Now, how do we know this? because at the top of the xsd file, it says targetNamespace="http://phiglobal.com/20200630" and the prefix PHIL: is also attached to this namespace in the instance file phil-20200630.xml with xmlns:PHIL="http://phiglobal.com/20200630"
It is common practice to choose concept IDs with the prefix followed by underscore followed by the local name. Some users rely on it, but following the levels of indirection, in spite of being more complex, is "safer": linkbase label loc_PHILWriteoffOfFinancingCosts -> concept ID PHIL_WriteoffOfFinancingCosts -> concept local name WriteoffOfFinancingCosts -> concept's fully qualified name PHIL:WriteoffOfFinancingCosts.
You probably notice how complex this is. In fact, this is the reason why it is worth using an XBRL processor, which will do all of this for you.
#Ghislain Fourny: Many thanks. I'm glad to know that I wasn't crazy for finding the situation complex. Knowing now that the linkbase labels are "free-to-choose", here is the specific BeautifulSoup workaround that I've come up with, in case anyone is interested:
labeldict = {}
resp = requests.get(calcurl, headers = headers)
ctext = resp.text
soup = BeautifulSoup(ctext, 'lxml')
tags = soup.find_all()
for tag in tags:
if tag.name == 'link:loc':
if tag.has_attr('xlink:href') and tag.has_attr('xlink:label'):
href = tag['xlink:href']
firstsplit = href.split('#')[1] ## gets the part of the link after the pound symbol
value = firstsplit.split('_')[1] ## gets the part after the underscore
key = tag['xlink:label']
labeldict[key] = value
Which results in a dictionary where keys are the 'loc_Phil'-type label names and the values are the plain concept names, e.g. labeldict['loc_PHILWriteoffOfFinancingCosts'] = 'WriteoffOfFinancingCosts'
This assumes that xsd links will always follow a format of '...#..._concept'. I haven't found any that don't follow that format, but that's not a guarantee.
First i add documents to index like this:
writer.add_document(title=doc_path.split(os.sep)[-1], path=doc_path, content=text, textdata=text)
And then i just need to delete one of them completely from index by it's path. Documentation says there are few no low level method to do this:
delete_by_term(fieldname, termtext)
Deletes any documents where the given (indexed) field contains the
given term. This is mostly useful for ID or KEYWORD fields.
delete_by_query(query)
Deletes any documents that match the given query.
but i can't find suitable and very convenient method for me where i can specify path of the document and just remove it. There is some low level method where i can specify internal doc_number, which i supposed to get somehow.
Can anyone give me advice how it's better to accomplish this task?
ix = open_dir('/my_index_dir_path/..')
writer = ix.writer()
writer.delete_by_term('path', doc_path)
writer.commit()
delete_by_term
method does exactly what i need. Note, that first argument is a text string 'path', and them goes the actual path. My mistake was to put an actual path instead of attribute name.
I am using Python to pick a specific set of values from my XML:
children = root[2].getchildren()
for child in children:
ET.dump(child)
Once I use this I get a print of exactly what I need from my XML. I can also change the root number to access different data. I want to export this value as a new separate xml, however when I use :
tree.write('new.xml')
It exports the entire XML, as before. It is not just the value I specified above, value I selected.
I have to verify the text of drop down list elements. How can I verify the same using python script in squish tool ?
Naive approach:
Record (then replay) selecting each of the entries. Use exception handling to log accessing individual entries and be able to proceed to test script execution.
More flexible approach:
Recording selecting one of the entries. This gives you script code to make the open the drop down and the object name of the drop down list. Then use object.children() to get all child elements of the drop down list object.
Pseudo example:
drop_down_list = waitForObject(...)
children = object.children(drop_down_list)
test.verify("Entry 1", children[0].text)
(You have to check the properties of the children to see which actual property contains the text or whatever else you want to verify.)
xml file snapshot
From above .xml file I am extracting article-id, article-title, abstract and keywords. For normal text inside single tag getting correct results. But text with multiple tags such as:
<title-group>
<article-title>
Acetylcholinesterase-Inhibiting Activity of Pyrrole Derivatives from a Novel Marine Gliding Bacterium,
<italic>Rapidithrix thailandica</italic>
</article-title>
</title-group>
.
.
same is for abstract...
I got output as:
OrderedDict([(u'italic**', u'Rapidithrix thailandica'), ('#text', u'Acetylcholines terase-Inhibiting Activity of Pyrrole Derivatives from a Novel Marine Gliding Ba cterium,')])
code has considered tag as a text and the o/p generated is also not in the sequence.
How to simply extract text from such input document as "Acetylcholinesterase-Inhibiting Activity of Pyrrole Derivatives from a Novel Marine Gliding Bacterium, Rapidithrix thailandica".
I am using below python code to perform above task..
import xmltodict
import os
from os.path import basename
import re
with open('2630847.nxml') as fd:
doc = xmltodict.parse(fd.read())
pmc_id = doc['article']['front']['article-meta']['article-id'][1]['#text']
article_title = doc['article']['front']['article-meta']['title-group']['article-title']
y = doc['article']['front']['article-meta']['abstract']
y = y.items()[0]
article_abstract = [g.encode('ascii','ignore') for g in y][1]
z = doc['article']['front']['article-meta']['kwd-group']['kwd']
zz = [g.encode('ascii','ignore') for g in z]
article_keywords = ",".join(zz).replace(","," ")
fout = open(str(pmc_id)+".txt","w")
fout.write(str(pmc_id)+"\n"+str(article_title)+". "+str(article_abstract)+". "+str(article_keywords))
Can somebody please suggest corrections..
xmltodict will likely be hard to use for your data. PMC journal articles are definitely not what the authors could have had in mind. Putting any but the most trivial XML into xmltodict is pounding a round peg into a square hole -- you might succeed, but it won't be pretty. I explain this further below under "tldr"....
Instead, I suggest you use a library whose data model fits your data better, such as xml.dom, minidom, or recent versions of BeautifulSoup. In many such libraries you just load the document with one call and then call some function like innerText() to get all the text content of it. You could even just load the document into a browser and call the Javascript innerText() function to get what you want. If the tool you choose doesn't provide innertext() already, it is:
def innertext(node):
t = ""
for curNode in node.childNodes:
if (isinstance(curNode, Text)):
t += curNode.nodeValue
elif (isinstance(curNode, Element)):
t += curNode.innerText
return(t)
You could tweak that to put spaces between the text nodes, depending on your data.
Hope that helps.
==tldr==
xmltodict makes an admirable attempt at making XML "as simple as possible"; but IMHO it errs in making it simpler than possible.
xmltodict basically works by turning every element into a dict, with its children as the dict items, keyed by their element names. But in many cases (such as yours), XML data isn't very much like that at all. For example, an element can have many children with the same name, but a dict can't.
So xmltodict has to do something special. It turns adjacent instances of the same element type into an array (without the element type). Here's an example excerpted from https://github.com/martinblech/xmltodict):
<and>
<many>elements</many>
<many>more elements</many>
</and>
becomes:
"and": {
"many": [
"elements",
"more elements"
]
},
First off, this means that xmltodict always loses the ordering information about child elements unless they are of the same type. So a section that contains a mix of paragraphs, lists, blockquotes, and so on, will either fail to load in xmltodict, or have all the scattered instances of each kind of child gathered together, completely losing their order.
The xmltodict approach also introduces frequent special-cases -- for example, you can't just get a list of all the children, or use len() to find out how many there are, etc. etc., because at every step you have to check whether you're really at a child element, or at a list of them.
Looking at xmltodict's own examples, you'll see that they mostly consist of walking down the tree by element names, but every now and then there's an integer subscript -- that's for the cases where these arrays are needed. But unless the data is unusually simple (which yours isn't), you won't know where that is. For example, if one DIV in an HTML document happens to contain only one P, the code to access the P needs one fewer subscript than with another DIV that happens to have more than one P.
It seems to me undesirable that the number of subscripts to get to something depends on how many siblings it has, and their types.
Alas, the structure still isn't good enough. Since child elements may have their own child elements, just making them strings in that extra array won't be enough. Sometimes they'll have to be dicts again, with some of their items in turn perhaps being arrays, some of whose items may be dicts, and so on. Writing the correct traversal algorithm to gather up the text is significantly harder than the DOM one shown above.
To be completely fair, there is some XML in which the order doesn't matter logically -- for example, you could export a SQL table into an XML file, using a container element for each record with a child element for each field. The order of fields is not information, so if you load such XML into xmltodict, losing the order doesn't matter. Likewise if you serialized Python data that was already just a dict. But those are very specialized edge cases. xmltodict might be an excellent choice for a case like that -- but the articles you're looking at are very far from that.