Correctly parse PDF paragraphs with Python - python

I am creating a Python script that is supposed to load a bunch of PDF files from the system, do some data analysis and output the results. The nature of the data analysis is such that I must parse the PDF by paragraph, and for every paragraph I must iterate over every phrase check if some conditions are met.
I am currently parsing using Tika. And this is the way I am evaluating paragraphs.
This is what I am currently doing, I am loading the content, then, replace every occurrence of one or more newlines with a unique string. Replace every regular newline with a space, replace that unique string with a double new line. I did this so it's more clear which newline delimits a paragraph. Then I proceed to extract paragraphs and return the list of paragraphs with no dupes (Tika sometimes duplicates stuff).
def getpdfcontent(path):
pdf_content = extract_pdf(path)
text = re.sub(r"\n{2,}", "<131313>", pdf_content['content'])
text = text.replace("\n", " ")
text = text.replace("<131313>", "\n\n")
paragraphs = extractparagraphs(text.splitlines())
return removeduplicates(paragraphs)
This is how I extract paragraphs. I check if the current line is empty, an if the current paragraph has something in it, and I append it to a list.
def extractparagraphs(lines):
current = ""
paragraphs = []
for line in lines:
if not line.strip():
if current.strip():
paragraphs.append(current)
current = ""
continue
current += line.strip()
return paragraphs
This is how I get pharses, I might add !? to the split too.
def getphrases(document):
phrases = []
phr = document.split(".")
phrases.extend(phr)
return phrases
Now my priority is to know if I can improve the parsing ?
If not, is there any optimisations I can do ?

Related

How to read from text file into array paragraph by paragraph?

Making a text based game and want to read from the story text file via paragraph rather than printing a certain amount of characters?
You wake up from a dazed slumber to find yourself in a deep dank cave with moonlight casting upon the entrance...
You see a figure approaching towards you... Drawing nearer you hear him speak...
You want this: my_list = my_string.splitlines()
https://docs.python.org/3/library/stdtypes.html#str.splitlines
Like #martineau suggested you need a delimiter for separate different paragraphs.
This can even be a new line character (\n) and after you have it you read all content of the file and split it by the defined delimiter.
Doing so you generate a list of elements with each one being a paragraph.
Some example code:
delimiter = "\n"
with open("paragraphs.txt", "r") as paragraphs_file:
all_content = paragraphs_file.read() #reading all the content in one step
#using the string methods we split it
paragraphs = all_content.split(delimiter)
This approach has some drawbacks like the fact that read all the content and if the file is big you fill the memory with thing that you don't need now, at the moment of the story.
Looking at your text example and knowing that you will continuously print the retrieved text, reading one line a time could be a better solution:
with open("paragraphs.txt", "r") as paragraphs_file:
for paragraph in paragraphs_file: #one line until the end of file
if paragraph != "\n":
print(paragraph)
Obviously add some logic control where you need it.

Text Cleaning Issues

I'm learning text cleaning using python online.
I have get rid of some stop words and lower the letter.
but when i execute this code, it doesn't show anything.
I don't know why.
# we add some words to the stop word list
texts, article = [], []
for w in doc:
# if it's not a stop word or punctuation mark, add it to our article!
if w.text != '\n' and not w.is_stop and not w.is_punct and not w.like_num and w.text != 'I':
# we add the lematized version of the word
article.append(w.lemma_)
# if it's a new line, it means we're onto our next document
if w.text == '\n':
texts.append(article)
article = []
when i try to output texts, it's just blank.
I believe the 'texts' list and 'article' list refer to same content and hence, clearing one list's content also clears the other list.
Here is a link to a similar question: Python: Append a list to another list and Clear the first list
Please see if the above are useful.

Need help finding the correct regex pattern for my string pattern

I'm terrible with RegEx patterns, and I'm writing a simple python program that requires splitting lines of a file into a 'content' part and a 'tags' part, and then further splitting the tags parts into individual tags. Here's a simple example of what one line of my file might look like:
The Beatles <music,rock,60s,70s>
I've opened my file with begun reading lines like this:
def Load(self, filename):
file = open(filename, r)
for line in file:
#Ignore comments and empty lines..
if not line.startswith('#') and not line.strip():
#...
Forgive my likely terrible Python, it's my first few days with the language. Anyway, next I was thinking it would be useful to use a regex to break my string into sections - with a variable to store the 'content' (for example, "The Beatles"), and a list/set to store each of the tags. As such, I need a regex (or two?) that can:
Split the raw part from the <> part.
And split the tags part into a list based on the commas.
Finally, I want to make sure that the content part retains its capitalization and inner spacing. But I want to make sure the tags are all lower-case and without white space.
I'm wondering if any of the regex experts out there can help me find the correct pattern(s) to achieve my goals here?
This is a solution that gets around the problem without using by relying on multiple splits.
# This separates the string into the content and the remainder
content, tagStr = line.split('<')
# This splits the tagStr into individual tags. [:-1] is used to remove trailing '>'
tags = tagStr[:-1].split(',')
print content
print tags
The problem with this is that it leaves a trailing whitespace after the content.
You can remove this with:
content = content[:-1]

Python - RegEx match does not write to file if it contains full-stop

I'm trying to use a RegEx expression in a Python script in order to find specific variables within a webpage. I then export this using a csv file. However, if the found group contains a full-stop, it does not export at all. How do I remedy this?
In this webpage, the item displayed changes depending on a code inputted. My script automates the inputting of codes, and then records the item produced. Here are the relevant parts of my code:
import re
regName = r'The item name is (.*?)\.'
response = opener.open(
'http://website.com/webpage.php' + itemValues)
html = response.read()
responseDecode = html.decode('utf8')
name = re.findall(regName, responseDecode)
#Convert stuff to Unicode
uniName = name[0].encode('utf8', 'replace')
with open("readable.txt", "a") as file:
file.write("\n"*2)
file.write(uniName + '\n')
Of note, I convert to unicode because some of the item names contain accented characters.
EDIT: an example of something that would not work would be, for instance, R.O.B.O.T . All that would be written would be R
Try using regName = r'The item name is (.*?)\.$' The $ marks the end of the string, which would allow the other full stops to not be consumed early. Right now the regular expression is being greedy and matching on the first one.
Or if the string doesn't end right there, try adding a space or some other following character. You need to specify the kind of character that marks the end of the item string.

Need to match whole word completely from long set with no partials in Python

I have a function that - as a larger part of a different program - checks to see if a word entry is in a text file. So if the text file looks like this:
aardvark
aardvark's
aardvarks
abaci
.
.
.
zygotes
I just ran a quick if statement
infile = open("words","r") # Words is the file with all the words. . . yeah.
text = infile.read()
if word in text:
return 1
else:
return 0
Works, sort-of. The problem is, while it returns true for aardvark, and false for wj;ek, it also will return true for any SUBSET of any word. So, for example, the word rdva will come back as a 'word' because it IS in the file, as a subset of aardvark. I need it to match whole words only, and I've been quite stumped.
So how can I have it match an entire word (which is equivalent to an entire line, here) or nothing?
I apologize if this question is answered elsewhere, I searched before I posted!
Many Thanks!
Iterate over each line and see if the whole line matches:
def in_dictionary(word):
for line in open('words', 'r').readlines():
if word == line.strip():
return True
return False
When you use the in statement, you are basically asking whether the word is in the line.
Using == matches the whole line.
.strip() removes leading and trailing whitespace, which will cause hello to not equal {space}hello
There is a simpler approach. Your file is, conceptually, a list of words, so build that list of words (instead of a single string).
with open("words") as infile: words = infile.read().split()
return word in words
<string> in <string> does a substring search, but <anything> in <list> checks for membership. If you are going to check multiple times against the same list of words, then you may improve performance by instead storing a set of the words (just pass the list to the set constructor).
Blender's answer works, but here is a different way that doesn't require you to iterate yourself:
Each line is going to end with a newline character (\n). So, what you can do is put a \n before and after your checked string when comparing. So something like this:
infile = open("words","r") # Words is the file with all the words. . . yeah.
text = "\n" + infile.read() # add a newline before the file contents so we can check the first line
if "\n"+word+"\n" in text:
return 1
else:
return 0
Watch out, though -- your line endings may be \r\n or just \r too.
It could also have problems if the word you're checking CONTAINS a newline. Blender's answer is better.
That's all great until you want to verify every word in a longer text using that list. For me and /usr/share/dict/words it takes up to 3ms to check a single word in words. Therefore I suggest using a dictionary (no pun) instead. Lookups were about 2.5 thousand times faster with:
words = {}
for word in open('words', 'r').readlines():
words[word.strip()] = True
def find(word):
return word in words

Categories

Resources