PDF - Split Single Words into Individual Lines - Python 3 - python

I am trying to extract words from a PDF into individual lines, but can only do this with Text files as demonstrated below.
Moreover, the rule is that I cannot convert PDF files to TXT then perform this operation. It must be done on PDF files.
with open('filename.txt','r') as f:
for line in f:
for word in line.split():
print(word)
If filename.txt has just "Hello World!", then this function returns:
Hello
World!
I need to do the same with searchable PDF files as well. Any help would be appreciated.

Check out PyMuPDF. There's loads of stuff you can do, including get line by line text from a PDF using page.getText()

For the PDF, you should use pdf.miner or PyPDF2.
Here is a good article you can use to extract the text, and then you can use Anilkumar's method to extract line by line.
https://medium.com/#rqaiserr/how-to-convert-pdfs-into-searchable-key-words-with-python-85aab86c544f

You can use pdfreader to extract texts (plain and containing PDF operators) from PDF document
Here is a sample code extracting all the above from all document pages.
from pdfreader import SimplePDFViewer, PageDoesNotExist
fd = open(you_pdf_file_name, "rb")
viewer = SimplePDFViewer(fd)
plain_text = ""
pdf_markdown = ""
try:
while True:
viewer.render()
pdf_markdown += viewer.canvas.text_content
plain_text += "".join(viewer.canvas.strings)
viewer.next()
except PageDoesNotExist:
pass
Just want to outline, that text in PDFs usually do not come as "words", they look like commands to a conforming PDF viewer where and how to put a glyph. Which means a single word may be displayed by several commands. Read more on that in PDF 1.7 docs sec.9 - Text

when I saw filename.txt I got confused.
Since you are working with PDF below link might be helpful. See it helps
How to use PDFminer.six with python 3?

Related

Arabic output in python 3 is stored weirdly in a file

I made a small web scraping bot using Python 3, Currently it is taking the input between classes and thankfully puts them into .csv file, But when i open it i find the part in arabic letters of it like this:
وائل ÙتÙ
I tried arabic resharper but looks like it just does converting in direction and some sort of encoding, When storing the string it represent bad characters as same as the above
Also this code below makes a successful arabic content into text file:
s = "ذهب الطالب الى المدرسة"
with open("file.txt", "w", encoding="utf-8") as myfile:
myfile.write(s)
-Note i'm using Selenium driver to get the content:
content = driver.page_source
soup = BeautifulSoup(content)
Try this, it should work:
soup = BeautifulSoup(content.decode('utf-8'))
Answer after more digging into problem:
1- I found that if i open it with normal windows notepad - i can see the arabic content so python was producing the website content correctly!
2- I used this video as a reference to correctly show data in excel (which the problem was in):
https://www.youtube.com/watch?v=V6AR_Hi7p5Q

Write OCR retrieved text from each image to separate text file corresponding to each image

I am reading a pdf file and converting each page to images and saving the, Next I need to run OCR on each image and identify each image text and write it to a new text file.
I know how to get all text from all images and dump it into one text file.
pdf_dir = 'dir path'
os.chdir(pdf_dir)
for pdf_file in os.listdir(pdf_dir):
if pdf_file.endswith(".pdf"):
pages = convert_from_path(pdf_file, 300)
pdf_file = pdf_file[:-4]
for page in pages:
page.save("%s-page%d.jpg" % (pdf_file,pages.index(page)), "JPEG")
img_dir = 'dir path'
os.chdir(img_dir)
docs = []
for img_file in os.listdir(img_dir):
if img_file.endswith(".jpg"):
texts = str(((pytesseract.image_to_string(Image.open(img_file)))))
text = texts.replace('-\n', '')
print(texts)
img_file = img_file[:-4]
for text in texts:
file = img_file + ".txt"
# create the new file with "w+" as open it
with open(file, "w+") as f:
for texts in docs:
# write each element in my_list to file
f.write("%s" % str(texts))
print(file)
I need one text file to be written corresponding to each image which has recognized the text within that image. The files which are presently written are all empty and I am not sure what is going wrong. Can someone help?
There's kind of a lot to unpack here:
You're iterating over docs which is an empty list, to create the text files, so as a result, each text file is merely created (empty) and the file.write is never executed.
You're assigning text = texts.replace('-\n', '') but then you're not doing anything with it, instead iterating over for text in texts so within that loop, text is not the result of the replace but rather an item from the iterable texts.
Since texts is a str, each text in texts is a character.
You're then using texts (also previously assigned) as an iterator over docs (again, this is empty).
2 and 4 aren't necessarily problematic, but probably are not good practice. 1 seems to be the main culprit for why you're producing empty text files. 3 seems to also be a logical error as you almost certainly do not want to write out individual characters to the file(s).
So I think this is what you want, but it is untested:
for img_file in os.listdir(img_dir):
if img_file.endswith(".jpg"):
texts = str(((pytesseract.image_to_string(Image.open(img_file)))))
print(texts)
file = img_file[:-4] + ".txt"
#create the new file with "w+" as open it
with open(file, "w+") as f:
f.write(texts)
print(file)

File size increasing after extraction?

This is a pretty general question, and I don't even know whether this is the correct community for the question, if not just tell me.
I have recently had an html file from which I was extracting ~90 lines of HTML code (total lines were ~8000). I did this with a simple Python script. I stored my output (the shortened html code) into a text file. Now I am curious because the file size has increased? what could cause the file to get bigger after I extracted some part out of it?
File size before: 319.374 Bytes
File size after: 321.516 Bytes
Is this because of the different file formats html and txt?
Any help or suggestions appreciated!
Code:
import glob
import os
import re
def extractor():
os.chdir(r"F:\Test") # the directory containing my html
for file in glob.iglob("*.html"): # iterates over all files in the directory ending in .html
with open(file, encoding="utf8") as f, open((file.rsplit(".", 1)[0]) + ".txt", "w", encoding="utf8") as out:
contents = f.read()
extract = re.compile(r'StartTag.*?EndTag', re.S)
cut = extract.sub('', contents)
if re.search(extract, contents) is not None:
out.write(cut)
out.close()
extractor()
EDIT: I also tried using ".html" instead of ".txt" as filem format for my output file. However the difference still remains.
This code does not write to the original HTML file. Something else must be causing the increased file size .

How can I extract a text from a bytes file using python

I am trying to code a script that gets the code of a website, saves all html in a file and after that extracts some information.
For the moment I´ve done the first part, I've saved all html into a text file.
Now I have to extract the relevant information and then save it in another text file.
But I'm having problems with encoding and also I don´t know very well how to extract the text in python.
Parsing a website:
import urllib.request
file name to store the data
file_name = r'D:\scripts\datos.txt'
I want to get the text that goes after this tag <p class="item-description"> and before this other one </p>
tag_starts_with = '<p class="item-description">'
tag_ends_with = '</p>'
I get the website code and I save it into a text file
with urllib.request.urlopen("http://www.website.com/") as response, open(file_name, 'wb') as out_file:
data = response.read()
out_file.write(data)
print (out_file) # First question how can I print the file? Gives me an error, I can´t print bytes
the file is now full of html text so I want to open it and process it
file_for_results = open(r'D:\scripts\datos.txt',encoding="utf8")
Extract information from the file
second question how to do a substring of the lines that contain the file and get the text between p class="item-description" and
/p so i can store in file_for_results
Here is the pseudocode that I'm not capable to code.
for line in file_to_filter:
if line contains word_starts_with
copy in file_for_results until you find </p>
I am assuming this is an assignment of some sort, where you need to parse the html given an algorithm, if not just use Beautiful Soup.
The pseudocode actually translates to python code quite easily:
file_to_filter = open("file.html", 'r')
out_file = open("text_output",'w')
for line in file_to_filter:
if word_starts_with in line:
print(line, end='', file=out_file) # Store data in another file
if word_ends_with in line:
break
And of course you need to close the files, make sure you remove the tags and so on, but this is roughly what your code should be given this algorithm.

Delete a specific string (not line) from a text file python

I have a text file with two lines in a text file:
<BLAHBLAH>483920349<FOOFOO>
<BLAHBLAH>4493<FOOFOO>
Thats the only thing in the text file. Using python, I want to write to the text file so that i can take away BLAHBLAH and FOOFOO from each line. It seems like a simple task but after refreshing my file manipulation i cant seem to find a way to do it.
Help is greatly appreciated :)
Thanks!
If it's a text file as you say, and not HTML/XML/something else, just use replace:
for line in infile.readlines():
cleaned_line = line.replace("BLAHBLAH","")
cleaned_line = cleaned_line.replace("FOOFOO","")
and write cleaned_line to an output file.
f = open(path_to_file, "w+")
f.write(f.read().replace("<BLAHBLAH>","").replace("<FOOFOO>",""))
f.close()
Update (saving to another file):
f = open(path_to_input_file, "r")
output = open(path_to_output_file, "w")
output.write(f.read().replace("<BLAHBLAH>","").replace("<FOOFOO>",""))
f.close()
output.close()
Consider the regular expressions module re.
result_text = re.sub('<(.|\n)*?>',replacement_text,source_text)
The strings within < and > are identified. It is non-greedy, ie it will accept a substring of the least possible length. For example if you have "<1> text <2> more text", a greedy parser would take in "<1> text <2>", but a non-greedy parser takes in "<1>" and "<2>".
And of course, your replacement_text would be '' and source_text would be each line from the file.

Categories

Resources