How to split texts into sentences and write them to xml - python

I am trying to structure my text document in an xml structure, where each sentence gets an id. I have text documents with unstructured sentences and I would like to split the sentences using a '.' delimiter and write them to xml. Here is my code:
import re
#Read the file
with open ('C:\\Users\\ngwak\\Documents\\test.txt') as f:
content = [f]
split_content = []
for element in content:
split_content += re.split("(.)\s+", element)
print(split_content, sep='\n\n')
But I am getting this error already and I cant interpret it:
TypeError: expected string or buffer
How can I split my sentences and write them to xml? Thanks a lot.
This is how my txt file looks like:
In a formal sense, the germ of national consciousness can be traced back to the Peace Treaty of Hoachanas signed in 13–June-1858 between soldiers, all the chiefs except those of the Bondelswarts (who had not been involved in the previous fighting), as well as by Muewuta, two sons of amuaha, formerly a Commandant of Chief Onag of the Triku people. There is ample epistolary as well as oral evidence for this view. The most poignant statement is to be found in the now famous and oft-quoted letter of Onag to Bonagha written on May 13, 1890 in which, amongst other things, he says that on June 13 there are people coming. Again on the 01.02.2015 till the 01.05 there are some coming.
And I would like the sentences to be like this in xml:
<sentence id=01>In a formal sense, the germ of national consciousness
can be traced back to the Peace Treaty of Hoachanas signed in 13–June-
1858 between soldiers, all the chiefs except those of the Bondelswarts
(who had not been involved in the previous fighting), as well as by
Muewuta, two sons of amuaha, formerly a Commandant of Chief Onag of the
Triku people. </sentence>

text_file = open('C:\\Users\\ngwak\\Documents\\test.txt', "r")
textLinesFromFile = text_file.read().replace("\n","").split('.')
for sentenceNumber in range (0,len(textLinesFromFile)):
print (textLinesFromFile[sentenceNumber].strip())
#Or write each sentence in your XML

You don't need the content = [f] line.
with open ('C:\\Users\\ngwak\\Documents\\test.txt') as file:
split_content = []
for element in file:
split_content += re.split("(.)\s+", element)
print(split_content, sep='\n\n')
File objects are iterable. Using them in a for loop will iterate over each line.
Further Reading
Methods on File objects in the Python Docs
The example in this SO answer: Iterating on a file using Python

Related

How to extract questions from a word doc with Python using regex

I am using docx library to read files from a word doc, I am trying to extract only the questions using regex search and match. I found infinite ways of doing it but I keep getting a "TypeError".
The data I am trying to extract is this:
Will my financial aid pay for housing?
Off Campus Housing - After financial aid applies toward your tuition and fees, any remaining funds will be sent to you as a refund that will either be directly deposited (which can be set up through your account) or mailed to you as a paper check. You can then use the refund to pay your rent. It is important to note that financial aid may not be available when rent is due, so make sure to have a plan in place to pay your rent. Will my financial aid pay for housing?
"financial" "help" "house"
funds "univ oak"
"money" "chisho"
"pay" "chap"
"grant" "laurel"
What are the requirements to receive a room and grant?
How do I pay for my housing?
How do I pay for housing?
If there's also an easier method of exporting the word doc into a different type of file, that'll be great to know for feedback. Thank you
I am using regex 101, I've tried the following regex expressions to match only the sentences that end in a question mark
".*[?=?]$"
"^(W|w).*[?=?]$"
"^[A-Za-z].*[?=?]$"
import re
import sys
from docx import Document
wordDoc = Document('botDoc.docx')
result = re.search('.*[?=?]$', wordDoc)
print(result)
if result:
print(result.group(0))
for table in wordDoc.tables:
for row in table.rows:
for cell in row.cells:
print("test")
I expect to save the matching patterns into directories so I can export the data to a csv file
Your error:
result = re.search('.*[?=?]$', wordDoc)
I believe that this line is the cause of the problem. search() is expecting a string as a second parameter, but is receiving a Document object.
What you should do is use the findall() function. search() only finds the first match for a pattern; findall() finds all the matches and returns them as a list of strings, with each string representing one match.
Since you are working with docx, you would have to extract the contents of the docx and use them as second parameter of the findall() method. If I remember correctly, this is done by first extracting all the paragraphs, and then extracting the text of the individual paragraphs. Refer to this question.
FYI, the way you would do this for a simple text file is the following:
# Open file
f = open('test.txt', 'r')
# Feed the file text into findall(); it returns a list of all the found strings
strings = re.findall(r'your pattern', f.read())
Your Regex:
Unfortunately, your regex is not quite correct, because although logically it makes sense to match only sentences that end on a ?, one of your matches is place to pay your rent. Will my financial aid pay for housing?, for example. Only the second part of that sentence is an actual question. So discard any lower case letters. Your regex should be something like:
[A-Z].*\?$

create XML file parsing a text file // xml.etree.ElementTree not working

I am trying to structure data out of a text file into a XML file tagging parts of the text that I want to mark with XML taggers.
The Problem.
xml.etree.ElementTree does not recognise the string
The code so far.
import xml.etree.ElementTree as ET
with open('input/application_EN.txt', 'r') as f:
application_text=f.read()
The first thing I want to do is to tag the paragraphs. the text should look like:
<description>
<paragraph id=1>
blabla
</paragraph>
<paragraph id=2>
blabla
</paragraph>
...
</description>
so far I coded:
# splitting the text into paragraphs
list_of_paragraphs = application_text.splitlines()
# creating a new list where no_null paragraphs will be added
list_of_paragraphs_no_null=[]
# counter of paragraphs of the XML file
j=0
# Create the XML file with the paragraphs
for i,paragraph in enumerate(list_of_paragraphs):
# Adding only the paragraphs different than ''
if paragraph != '':
j = j + 1
# be careful with the space after and before the tag.
# Adding the XML tags per paragraph
xml_element = '<paragraph id=\"' + str(j) +'\">' + paragraph.strip() + ' </paragraph>'
# Now I pass the whole string to the XML constructor
root = ET.fromstring(description_text)
I get this error:
not well-formed (invalid token): line 1, column 6
After some investigation I realised that the error is given by the fact that the text contains the symbol "&".
Adding and taking out "&" in several places confirms that.
The question is why? why is "&" not treated as text. What can I do?
I know I could replace all "&" but then I will loose information since "& Co." is a string quite important.
I would like the text to stay intact. (no changing content).
Suggestions?
thanks.
EDIT:
IN order to make it easier here you have the beginner of the text I am working on (instead of open a file you might be add this to check it):
application_text='Language=English
Has all kind of kind of references. also measures.
Photovoltaic solar cells for directly converting radiant energy from the sun into electrical energy are well known. The manufacture of photovoltaic solar cells involves provision of semiconductor substrates in the form of sheets or wafers having a shallow p-n junction adjacent one surface thereof (commonly called the "front surface"). Such substrates may include an insulating anti-reflection ("AR") coating on their front surfaces, and are sometimes referred to as "solar cell wafers". The anti-reflection coating is transparent to solar radiation. In the case of silicon solar cells, the AR coating is often made of silicon nitride or an oxide of silicon or titanium. Such solar cells are manufactured and sold by E.I. duPont de Nemeurs & Co.'
As you see at the end there is a symbol "& Co." which causes trouble.
from:
& Symbol causing error in XML Code
Some characters have special meaning in XML and ampersand (&) is one of them. Consequently, these characters should be substituted (ie use string replacement) with their respective entity references. Per the XML specification, there are 5 predefined entities in XML:
< < less than
> > greater than
& & ampersand
&apos; ' apostrophe
" " quotation mark
thanks #fallenreaper for pointing me towards BS to create XML files.

How to robustly extract author names from pdf papers?

I'd like to extract author names from pdf papers. Does anybody know a robust way to do so?
For example, I'd like to extract the name Archana Shukla from this pdf https://arxiv.org/pdf/1111.1648
PDF documents contain Metadata. It includes information about the document and its contents such as the author’s name, keywords, copyright information. See Adobe doc.
You can use PyPDF2 to extract PDF Metadata. See the documentation about the DocumentInformation class.
This information may not be filled and can appear blank. So, one possibility is to parse the beginning or the end of the text and extract what you think is the author name. Of course, it is not reliable. But, if you have a bibliographic database, to can try a match.
Nowadays, editors like Microsoft Word or Libre Office Writer always fill the author name in the Metadata. And it is copied in the PDF when you export your documents. So, this should work for you. Give it a try and tell us!
I am going to pre-suppose that you have a way to extract text from a PDF document, so the question is really "how can I figure out the author from this text". I think one straightforward solution is to use the correspondence email. Here is an example implementation:
import difflib
# Some sample text
pdf_text="""SENTIMENT ANALYSIS OF DOCUMENT BASED ON ANNOTATION\n
Archana Shukla\nDepartment of Computer Science and Engineering,
Motilal Nehru National Institute of Technology,
Allahabad\narchana#mnnit.ac.in\nABSTRACT\nI present a tool which
tells the quality of document or its usefulness based on annotations."""
def find_author(some_text):
words = some_text.split(" ")
emails = []
for word in words:
if "#" in word:
emails.append(word)
emails_clean = emails[0].split("\n")
actual_email = [a for a in emails_clean if "#" in a]
actual_email = actual_email[0]
maybe_name = actual_email.split("#")[0]
all_words_lists = [a.split("\n") for a in words]
words = [a for sublist in all_words_lists for a in sublist]
words.remove(actual_email)
return difflib.get_close_matches(maybe_name, words)
In this case, find_author(pdf_text) returns ['Archana']. It's not perfect, but it's not incorrect. I think you could likely extend this in some clever ways, perhaps by getting the next word after the result or by combining this guess with metadata, or even by finding the DOI in the document if/when it exists and looking it up through some API, but nonetheless I think this should be a good starting point.
First thing first, there are some pdfs out there which pages are image. I don't know if you can extract the text from image easily. But from the pdf link you mentioned, I think it can be done. There is exist a package called PyPDF2 which as I know, can extract the text from pdf. All that left is to scan the last few pages and parse the Author names.
An example on how to use the package described here. Some of the code listed there is as follows:
import PyPDF2
pdfFileObj = open('meetingminutes.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
disp(pdfReader.numPages)
pageObj = pdfReader.getPage(0)
pageObj.extractText()

How to read list element in Python from a text file?

My text file is like below.
[0, "we break dance not hearts by Short Stack is my ringtone.... i LOVE that !!!.....\n"]
[1, "I want to write a . I think I will.\n"]
[2, "#va_stress broke my twitter..\n"]
[3, "\" "Y must people insist on talking about stupid politics on the comments of a bubblegum pop . Sorry\n"]
[4, "aww great "Picture to burn"\n"]
[5, "#jessdelight I just played ur joint two s ago. Everyone in studio was feeling it!\n"]
[6, "http://img207.imageshack.us/my.php?image=wpcl10670s.jpg her s are so perfect.\n"]
[7, "cannot hear the new due to geographic location. i am geographically undesirable. and tune-less\n"]
[8, "\" couples in public\n"]
[9, "damn wendy's commerical got that damn in my head.\n"]
[10, "i swear to cheese & crackers #zyuuup is in Detroit like every 2 months & i NEVER get to see him! i swear this blows monkeyballs!\n"]
[11, "\" getting ready for school. after i print out this\n"]
I want to read every second element from the list mean all the text tweets into array.
I wrote
tweets = []
for line in open('tweets.txt').readlines():
print line[1]
tweets.append(line)
but when I see the output, It just takes 2nd character of every line.
When you read a text file in Python, the lines are just strings. They aren't automatically converted to some other data structure.
In your case, it looks like each line in your file contains a JSON list. In that case, you can parse the line first using json.loads(). This converts the string to a Python list which you can then take the second element of:
import json
with open('tweets.txt') as fp:
tweets = [json.loads(line)[1] for line in fp]
May be you should consider to use json.loads method :
import json
tweets = []
for line in open('tweets.txt').readlines():
print json.loads(line)[1]
tweets.append(line)
There is more pythonic way in #Erik Cederstrand 's comment.
Rather than guessing what format the data is in, you should find out.
If you're generating it yourself, and don't know how to parse back in what you're creating, change your code to generate something that can be easily parsed with the same library used to generate it, like JsonLines or CSV.
If you're ingesting it from some API, read the documentation for that API and parse it the way it's documented.
If someone handed you the file and told you to parse it, ask that someone what format it's in.
Occasionally, you do have to deal with some crufty old file in some format that was never documented and nobody remembers what it was. In that case, you do have to reverse engineer it. But what you want to do then is guess at likely possibilities, and try to parse it with as much validation and error handling as possible, to verify that you guessed right.
In this case, the format looks a lot like either JSON lines or ndjson. Both are slightly different ways of encoding multiple objects with one JSON text per line, with specific restrictions on those texts and the way they're encoded and the whitespace between them.
So, while a quick&dirty parser like this will probably work:
with open('tweets.txt') as f:
for line in f:
tweet = json.loads(line)
dosomething(tweet)
You probably want to use a library like jsonlines:
with jsonlines.open('tweets.txt') as f:
for tweet in f:
dosomething(tweet)
The fact that the quick&dirty parser works on JSON lines is, of course, part of the point of that format—but if you don't actually know whether you have JSON lines or not, you're better off making sure.
Since your input looks like Python expressions, I'd use ast.literal_eval to parse them.
Here is an example:
import ast
with open('tweets.txt') as fp:
tweets = [ast.literal_eval(line)[1] for line in fp]
print(tweets)
Output:
['we break dance not hearts by Short Stack is my ringtone.... i LOVE that !!!.....\n', 'I want to write a . I think I will.\n', '#va_stress broke my twitter..\n', '" "Y must people insist on talking about stupid politics on the comments of a bubblegum pop . Sorry\n', 'aww great "Picture to burn"\n', '#jessdelight I just played ur joint two s ago. Everyone in studio was feeling it!\n', 'http://img207.imageshack.us/my.php?image=wpcl10670s.jpg her s are so perfect.\n', 'cannot hear the new due to geographic location. i am geographically undesirable. and tune-less\n', '" couples in public\n', "damn wendy's commerical got that damn in my head.\n", 'i swear to cheese & crackers #zyuuup is in Detroit like every 2 months & i NEVER get to see him! i swear this blows monkeyballs!\n', '" getting ready for school. after i print out this\n']

Output/save a certain string within a text file

Let's say I open a text file with Python3:
fname1 = "filename.txt"
with open(fname1, "rt", encoding='latin1') as in_file:
readable_file = in_file.read()
The output is a standard text file of paragraphs:
\n\n"Well done, Mrs. Martin!" thought Emma. "You know what you are about."\n\n"And when she had come away, Mrs. Martin was so very kind as to send\nMrs. Goddard a beautiful goose--the finest goose Mrs. Goddard had\never seen. Mrs. Goddard had dressed it on a Sunday, and asked all\nthe three teachers, Miss Nash, and Miss Prince, and Miss Richardson,\nto sup with her."\n\n"Mr. Martin, I suppose, is not a man of information beyond the line\nof his own business? He does not read?"\n\n"Oh yes!--that is, no--I do not know--but I believe he has\nread a good deal--but not what you would think any thing of.\nHe reads the Agricultural Reports, and some other books that lay\nin one of the window seats--but he reads all _them_ to himself.\nBut sometimes of an evening, before we went to cards, he would read\nsomething aloud out of the Elegant Extracts, very entertaining.\nAnd I know he has read the Vicar of Wakefield. He never read the\nRomance of the Forest, nor The Children of the Abbey. He had never\nheard of such books before I mentioned them, but he is determined\nto get them now as soon as ever he can."\n\nThe next question was--\n\n"What sort of looking man is Mr. Martin?"
How can one save only a certain string within this file? For example, how does one save the sentence
And when she had come away, Mrs. Martin was so very kind as to send\nMrs. Goddard a beautiful goose--the finest goose Mrs. Goddard had\never seen.
into a separate text file? How do you know the indices where to access this sentence?
CLARIFICATION: There should be no decision statements to make. The end goal is to create a program which the user could "save" sentences or paragraphs separately. I am asking a more general question at the moment.
Let's say there's a certain paragraph I like in this text. I would like a way to append this quickly to a JSON file or text file. In principle, how does one do this? Tagging all sentences? Is there a way to isolate paragraphs? (To repeat above) How do you know the indices where to access this sentence? (especially if there is no "decision logic")
If I know the indices, couldn't I simply slice the string?

Categories

Resources