Not picking up all XML elementree sub-sub elements in python

Not picking up all XML elementree sub-sub elements in python - python

I am trying to capture all claims text tax in a bunch of xml patent files but having trouble with tags within the <claim-test>. Sometimes there's another <claim-text> and sometimes there is also <claim-ref> interrupting the text. In my output, the code gets cut off. Usually there are over 10 claims. I am trying to only get the text in the claim text.
I've already looked and tried the following but these don't work:
xml elementree missing elements python and
How to get all sub-elements of an element tree with Python ElementTree?
I've included a snippet here as it does get quite long to capture all.
My code for this is below (where fullname is the file name and directory).
for _, elem in iterparse(fullname):
description = '' # reset to empty string at beginning of each loop
abtext = '' # reset to empty string at beginning of each loop
claimtext= '' # reset to empty string
if elem.tag == 'claims':
for node4 in tree.findall('.//claims/claim/claim-text'):
claimtext = claimtext + node4.text
f.write('\n\nCLAIMTEXT\n\n\n')
f.write(smart_str(claimtext) + '\n\n')
#put row in df
row = dict(zip(['PATENT_ID', 'CLASS', 'ABSTRACT', 'DESCRIPTION','CLAIMS'], [data,cat,abtext,description,claimtext]))
row_s = pd.Series(row)
row_s.name = i
df = df.append(row_s)
So the resulting problem is twofold a) I only get one of the text printed to fil and b) nothing comes into the dataframe at all. I'm not sure if that's part of the same problem or two separate problems. I can get the claims to print into a file and that works fine but skips some of the text.

Related

Extract parts of text (html) file based on characters before & after with python

I am trying to build a script that will extract specific parts (namely the link & its related description) out of an html file and return the result per line.
I 'm trying to build it using the lists in python, yet I 'm making a mistake somehow!
This is what I 've done so far, but it returns blank my values list:
import re
def subtext (data, first_link, last_link, first_descr, last_descr):
values = []
link = re.search('''"first_link"(.+?)"last_link"''', data)
values.append(link)
descr = re.search('''"first_descr"(.+?)"last_descr"''', data)
values.append(descr)
while values:
print(values)
html_file = input ("Type filepath: ")
html_code = open (html_file, "r")
html_data = html_code.read()
subtext (html_data, '''11px;">''', '''</td><td style="font-''')
html_code.close()

There is a html parser for python. But if you want use your code then you need fix those mistakes:
link = re.search('''"first_link"(.+?)"last_link"''', data)
values.append(link)
First of all, Your regex will search for strings "first_link" and "last_link" instead of values from function args. Use .format to create string form args.
Also in above code link will be re.Match object, not a string. Use group() to pick string from object - just make sure that it found something. Same story with next re.search.
while values:
print(values)
Here you will get into infinite loop of prints. Simply do print(values) without any loop.

Incorrect string being output into DF Pandas Python

I have a loop which scans a website for a particular element and then scrapes it and places it within a list and then this gets put into a string variable.
Postalcode3 outputs fine to the DF and this in turn outputs correctly to the csv, however, postalcode4 does not output anything and those cells are simply skipped from the csv
Here is the loop function -
for i in range (30):
page = requests.get('https://www.example.com'+ df.loc[i,'ga:pagePath'])
tree = html.fromstring(page.content)
postalcode2 = tree.xpath('//span[#itemprop="postalCode"]/text()')
postalcode = tree.xpath('//span[#itemprop="addressRegion"]/text()')
if not postalcode2 and not postalcode:
print(postalcode,postalcode2)
elif not postalcode2:
postalcode4 = postalcode[0]
# postalcode4 = postalcode4.replace(' ','')
df.loc[i,'postcode'] = postalcode4
elif not postalcode:
postalcode3 = postalcode2[0]
if 'Â' not in postalcode3:
postalcode3 = postalcode3.replace('\\xa0','')
postalcode3 = postalcode3.replace(' ','')
else:
postalcode3 = postalcode3.replace('\\xa0Â','')
postalcode3 = postalcode3.replace(' ','')
df.loc[i,'postcode'] = postalcode3
I have debugged it and can see that the string output by postalcode4 is correct and in the same format as postalcode3.
Postalcode3 has a load of character removal elements placed in as that particular web element comes full of useless characters.
I'm not entirely sure what's gone wrong.
This is how I read in the DF and insert the new column which will be written into by the loop function.
files = 'example.csv'
df = pandas.read_csv(files, index_col=0)
df.insert(5,'postcode','')

It's possible you aren't handling the web output correctly.
The content attribute of a requests.get response is a bytestring, but HTML content is text. If you don't decode the bytestring before you create the HTML then you may well find extraneous characters due to the encoding appear in your text. The correct way to handle these is not, however, to continue with a bytestring, but instead to convert the incoming bytestring to text by decoding it before calling html.fromstring.
You should really find the correct encoding using the Content-Encoding header, if it's present. As an experiment you might try replacing
tree = html.fromstring(page.content)
with
tree = html.fromstring(page.content.decode('utf-8')`
since many web sites will use UTF8 encoding. You may find that the responses then appear to make more sense, and that you don't need to strip so much "extraneous" stuff out.

What is a safe way to extract python code blocks from docx files and run them in a sandbox?

I have roughly 6000~6500 Microsoft Word .docx files with various types of formatted answer scripts inside them, in the sequence:
Python Programming Question in Bold
Answer in form of complete, correctly-indented, single-spaced, self-sufficient code
Unfortunately, there seems to be no fixed pattern delineating the code blocks from normal text. Some examples from the first 50 or so files:
Entire Question in bold, after which code starts abruptly, in
bold/italics
Question put in comments, after which code continues
Question completely missing, just code with numbered lists indicating start
Question completely missing, with a C/Python style comments indicating start
etc.
For now, I'm extracting the entire unformatted text through python-docx like this:
doc = Document(infil)
# For Unicode handling.
new_paragraphs = []
for paragraph in doc.paragraphs:
new_paragraphs.append((paragraph.text).encode("utf-8"))
new_paragraphs = list(map(lambda x: convert(x), new_paragraphs))
with open(outfil, 'w', encoding='utf-8') as f:
print('\n'.join(new_paragraphs), file=f)
Once extracted, I'll run them using the PyPy Sandboxing feature which I understand is safe and then assign points as if in a contest.
What I'm completely stuck on is how to detect the start and end of the code programmatically. Most of the language detection APIs are unneeded since I already know the language. This Question: How to detect source code in a text? suggests using linters and syntax highlighters like the Google Code Prettifier, but they don't solve the issue of detecting separate programs.
A suitable solution, from this programmers.se question, seems to be training markov chains, but I wanted some second opinions before embarking on such a vast project.
This extraction code will also be provided to all students after evaluation.
I apologize if the question is too broad or the answer too obvious.

Hummm, so you are looking for some kind of formatting pattern? That sounds kind of weird to me. Is there any kind of text or string pattern that you can exploit? I'm not sure if this will help or not, but the VBA script below searches through all Word documents in a folder and puts a 'X' in any field that matches a search criteria that you specify in Row1. It also put a hyperlink in ColA, so you can click the link and open the file, rather than searching around for the file. Here is a screen shot.
Script:
Sub OpenAndReadWordDoc()
Rows("2:1000000").Select
Range(Selection, Selection.End(xlDown)).Select
Selection.ClearContents
Range("A1").Select
' assumes that the previous procedure has been executed
Dim oWordApp As Word.Application
Dim oWordDoc As Word.Document
Dim blnStart As Boolean
Dim r As Long
Dim sFolder As String
Dim strFilePattern As String
Dim strFileName As String
Dim sFileName As String
Dim ws As Worksheet
Dim c As Long
Dim n As Long
'~~> Establish an Word application object
On Error Resume Next
Set oWordApp = GetObject(, "Word.Application")
If Err() Then
Set oWordApp = CreateObject("Word.Application")
' We started Word for this macro
blnStart = True
End If
On Error GoTo ErrHandler
Set ws = ActiveSheet
r = 1 ' startrow for the copied text from the Word document
' Last column
n = ws.Range("A1").End(xlToRight).Column
sFolder = "C:\Users\your_path_here\"
'~~> This is the extension you want to go in for
strFilePattern = "*.doc*"
'~~> Loop through the folder to get the word files
strFileName = Dir(sFolder & strFilePattern)
Do Until strFileName = ""
sFileName = sFolder & strFileName
'~~> Open the word doc
Set oWordDoc = oWordApp.Documents.Open(sFileName)
' Increase row number
r = r + 1
' Enter file name in column A
ws.Cells(r, 1).Value = sFileName
ActiveCell.Offset(1, 0).Select
ActiveSheet.Hyperlinks.Add Anchor:=Sheets("Sheet1").Range("A" & r), Address:=sFileName,
SubAddress:="A" & r, TextToDisplay:=sFileName
' Loop through the columns
For c = 2 To n
If oWordDoc.Content.Find.Execute(FindText:=Trim(ws.Cells(1, c).Value),
MatchWholeWord:=True, MatchCase:=False) Then
' If text found, enter Yes in column number c
ws.Cells(r, c).Value = "Yes"
End If
Next c
oWordDoc.Close SaveChanges:=False
'~~> Find next file
strFileName = Dir()
Loop
ExitHandler:
On Error Resume Next
' close the Word application
Set oWordDoc = Nothing
If blnStart Then
' We started Word, so we close it
oWordApp.Quit
End If
Set oWordApp = Nothing
Exit Sub
ErrHandler:
MsgBox Err.Description, vbExclamation
Resume ExitHandler
End Sub
Function GetDirectory(path)
GetDirectory = Left(path, InStrRev(path, "\"))
End Function

Using Python to split a Unicode file object into dictionary Keys and values

Hi and thanks for reading. I’ll admit that this is a progression on from a previous question I asked earlier, after I partially solved the issue. I am trying to process a block of text (file_object) in an earlier working function. The text or file_object happens to be in Unicode, but I have managed to convert to ascii text and split on a line by line basis. I am hoping to then further split the text on the ‘=’ symbol so that I can drop the text into a dictionary. For example Key: Value as ‘GPS Time’:’ 14:18:43’ so removing the trailing '.000' from the time (though this is a second issue).
Here’s the file_object format…
2015 Jan 01 20:07:16.047 GPS Info #Log packet ID
GPS Time = 14:18:43.000
Longitude = 000.65341
Latitude = +41.25385
Altitude = +111.400
This is my partially working function…
def process_data(file_object):
file_object = file_object.encode('ascii','ignore')
split = file_object.split('\n')
for i in range(len(split)):
while '=' in split[i]:
processed_data = (split[i].split('=', 1) for _ in xrange(len(split)))
return {k.strip(): v.strip() for k, v in processed_data}
This is the initial section of the main script that prompts the above function, and then sets GPS Time as the Dictionary key…
while (mypkt.Next()): #mypkt.Next is an API function in the log processor app I am using – essentially it grabs the whole GPS Info packet shown above
data = process_data(mypkt.Text, 1)
packets[data['GPS Time']] = data
The code above has no problem splitting the first instance ‘GPS Time’, but it ignores Lonitude, Latitude etc, To make matters worse, there is sometimes a blank line between each packet item too. I guess I need to store previous dictionary related splits before the ‘return’, but I am having difficulty trying to find out how to do this.
The dict output I am currently getting is…
'14:19:09.000': {'GPS Time': '14:19:09.000'},
But What I am hoping for is…
'14:19:09': {'GPS Time': '14:19:09',
‘Longitude’:’000.65341’,
‘Latitude’:’+41.25385’,
‘Altitude’:’+111.400’},
Thanks in advance for any help.
MikG

All this use of range(len(whatever)) is nonsense. You almost never need to do that in Python. Just iterate through the thing.
Your problem however is more fundamental: you return from inside the while loop. That means you only ever get one element, because as soon as that first line is processed, you return and the function ends.
Also, you have a while loop which means that processing will end as soon as the program encounters a line without an equals; but you have blank lines between each data line, so again execution would never proceed past the first one.
So all you need is:
split_data = file_object.split('\n')
result = {}
for line in split_data:
if '=' in line:
key, value = line.split('=', 1)
result[key.strip()] = value.strip()
return result

How do I preserve new lines when extracting text from html using lxml.text_content()

I am trying to learn to use Whoosh. I have a large collection of html documents I want to search. I discovered that the text_content() method creates some interesting problems for example I might have some text that is organized in a table that looks like
<html><table><tr><td>banana</td><td>republic</td></tr><tr><td>stateless</td><td>person</td></table></html>
When I take the original string and and get the tree and then use text_content to get the text in the following manner
mytree = html.fromstring(myString)
text = mytree.text_content()
The results have no spaces (as should be expected)
'bananarepublicstatelessperson'
I tried to insert new lines using string.replace()
myString = myString.replace('</tr>','</tr>\n')
I confirmed that the new line was present
'<html><table><tr><td>banana</td><td>republic</td></tr>\n<tr><td>stateless</td><td>person</td></table></html>'
but when I run the same code from above the line feeds are not present. Thus the resulting text_content() looks just like above.
This is a problem from me because I need to be able to separate words, I thought I could add non-breaking spaces after each td and line breaks after rows as well asd line breaks after body elements etc to get text that reasonably conforms to my original source.
I will note that I did some more testing and found that line breaks inserted after paragraph tag closes were preserved. But there is a lot of text in the tables that I need to be able to search.
Thanks for any assistance

You could use this solution:
import re
def striphtml(data):
p = re.compile(r'<.*?>')
return p.sub('', data)
>>> striphtml('I Want This <b>text!</b>')
>>> 'I Want This text!'
Found here: using python, Remove HTML tags/formatting from a string

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.