I am using tesseract for OCR. I am on ubuntu 18.04.
I have this program which extracts the texts from an image and print it. I want that program to create a new text file and paste the extracted content on to the new text file, but I am only able to do these
copy the content to clipboard
open new texteditor(geditor) file
I don't know how to paste the copied content
Here is my program which extracts the text from image
from pytesseract import image_to_string
from PIL import Image
print image_to_string(Image.open('sample.jpg'))
Here is the program which copies the text to clipboard,
import os
def addToClipBoard(text):
command = 'echo ' + text.strip() + '| clip'
os.system(command)
This program will open the geditor and create a new text file
import subprocess
proc = subprocess.Popen(['gedit', 'file.txt'])
Any help would be appreciated.
If you just want the text, then open a text file and write to it:
from pytesseract import image_to_string
from PIL import Image
text = image_to_string(Image.open('sample.jpg'))
with open('file.txt', mode = 'w') as f:
f.write(text)
Just as I proposed in the comment, create a new file and write the extracted text into it:
with open('file.txt', 'w') as outfile:
outfile.write(image_to_string(Image.open('sample.jpg')))
Related
I want to extract text from multiple text files and the idea is that i have a folder and all text files are there in that folder.
I have tried and succesfully get the text but the thing is that when i use that string buffer somewhere else then only first text file text are visbile to me.
I want to store these texts to a particular string buffer.
what i have done:
import glob
import io
Raw_txt = " "
files = [file for file in glob.glob(r'C:\\Users\\Hp\\Desktop\\RAW\\*.txt')]
for file_name in files:
with io.open(file_name, 'r') as image_file:
content1 = image_file.read()
Raw_txt = content1
print(Raw_txt)
This Raw_txt buffer only works in this loop but i want this buffer somewhere else.
Thanks!
I think the issue is related to where you load the content of your text files.
Raw_txt is overwritten with each file.
I would recommend you to do something like this where the text is appended:
import glob
Raw_txt = ""
files = [file for file in glob.glob(r'C:\\Users\\Hp\\Desktop\\RAW\\*.txt')]
for file_name in files:
with open(file_name,"r+") as file:
Raw_txt += file.read() + "\n" # I added a new line in case you want to separate the different text content of each file
print(Raw_txt)
Also in order to read a text file you don't need io module.
My company is undergoing a re-brand and as such the logo has changed. We have hundreds, if not thousands of word and excel docs that have the old logo. Can I use python to find and replace the image?
I found a code (below) that would do this for text (replacing Adber with Abder), can it be modified to change images?
import os, re
directory = os.listdir('/')
os.chdir('/Users/')
for file in directory:
open_file = open(file,'r')
read_file = open_file.read()
regex = re.compile('Adber')
read_file = regex.sub('Abder', read_file)
write_file = open(file,'w')
write_file.write(read_file)
Below is code I have written that does OCR with pytesseract.
import pyperclip, os, glob, pytesseract
from PIL import Image
all_files = glob.glob('/Users/<user>/Desktop/*')
filename = max(all_files, key=os.path.getctime)
text = pytesseract.image_to_string(Image.open(filename))
pyperclip.copy(text)
Simple enough, it performs basic ocr to the image specified, the latest image that I have took a screenshot of. What I was wondering is how to put have the ocr'ed text in my clip board. I have looked into the pyperclip library, and a simple pyperclip.copy should do it. I have tried simply copying it, and everywhere says that is correct. Is there something I am missing?
That should work, but if it doesn't you can try pushing it in and out of a file.
import pyperclip, os, glob, pytesseract, shutil
from PIL import Image
all_files = glob.glob('/Users/<user>/Desktop/*')
filename = max(all_files, key=os.path.getctime)
text = pytesseract.image_to_string(Image.open(filename))
#writes text to file
file = open("/Users/<user>/pyOCR/string.txt","r+")
file.truncate(0)
file.write(text)
file.close()
#read text from file
with open('/Users/<user>/pyOCR/string.txt') as f:
lines = f.readlines()
f.close()
full_text=''
for line in lines:
full_text+=line
#copies text
pyperclip.copy(full_text)
I'm trying very hard to find the way to convert a PDF file to a .docx file with Python.
I have seen other posts related with this, but none of them seem to work correctly in my case.
I'm using specifically
import os
import subprocess
for top, dirs, files in os.walk('/my/pdf/folder'):
for filename in files:
if filename.endswith('.pdf'):
abspath = os.path.join(top, filename)
subprocess.call('lowriter --invisible --convert-to doc "{}"'
.format(abspath), shell=True)
This gives me Output[1], but then, I can't find any .docx document in my folder.
I have LibreOffice 5.3 installed.
Any clues about it?
Thank you in advance!
I am not aware of a way to convert a pdf file into a Word file using libreoffice.
However, you can convert from a pdf to a html and then convert the html to a docx.
Firstly, get the commands running on the command line. (The following is on Linux. So you may have to fill in path names to the soffice binary and use a full path for the input file on your OS)
soffice --convert-to html ./my_pdf_file.pdf
then
soffice --convert-to docx:'MS Word 2007 XML' ./my_pdf_file.html
You should end up with:
my_pdf_file.pdf
my_pdf_file.html
my_pdf_file.docx
Now wrap the commands in your subprocess code
I use this for multiple files
####
from pdf2docx import Converter
import os
# # # dir_path for input reading and output files & a for loop # # #
path_input = '/pdftodocx/input/'
path_output = '/pdftodocx/output/'
for file in os.listdir(path_input):
cv = Converter(path_input+file)
cv.convert(path_output+file+'.docx', start=0, end=None)
cv.close()
print(file)
Below code worked for me.
import win32com.client
word = win32com.client.Dispatch("Word.Application")
word.visible = 1
pdfdoc = 'NewDoc.pdf'
todocx = 'NewDoc.docx'
wb1 = word.Documents.Open(pdfdoc)
wb1.SaveAs(todocx, FileFormat=16) # file format for docx
wb1.Close()
word.Quit()
My approach does not follow the same methodology of using subsystems. However this one does the job of reading through all the pages of a PDF document and moving them to a docx file. Note: It only works with text; images and other objects are usually ignored.
#Description: This python script will allow you to fetch text information from a pdf file
#import libraries
import PyPDF2
import os
import docx
mydoc = docx.Document() # document type
pdfFileObj = open('pdf/filename.pdf', 'rb') # pdffile loction
pdfReader = PyPDF2.PdfFileReader(pdfFileObj) # define pdf reader object
# Loop through all the pages
for pageNum in range(1, pdfReader.numPages):
pageObj = pdfReader.getPage(pageNum)
pdfContent = pageObj.extractText() #extracts the content from the page.
print(pdfContent) # print statement to test output in the terminal. codeline optional.
mydoc.add_paragraph(pdfContent) # this adds the content to the word document
mydoc.save("pdf/filename.docx") # Give a name to your output file.
I have successfully done this with pdf2docx :
from pdf2docx import parse
pdf_file = "test.pdf"
word_file = "test.docx"
parse(pdf_file, word_file, start=0, end=None)
I have the following Python script that reads the image urls from a text file and then downloads the images and saves it in the same folder. The images are downloaded file but for some reason
# this script is used to download the images using the provided url
import requests
import ntpath
# save image data
def save_image_data(image_data,file_name):
with open(file_name,'wb') as file_object:
file_object.write(image_data)
# read the images_url file
with open('images_urls_small.txt') as file_object:
for line in file_object:
file_name = ntpath.basename(line)
print(file_name)
# download the image
try:
image_data = requests.get(line).content
except:
print("error download an image")
# save the image
save_image_data(image_data,file_name)
The images are downloaded fine but for reason it ends up with ? after their file name as shown in the screenshot below.
What am I missing?
You taking the filenames from a file:
for line in file_object:
file_name = ntpath.basename(line)
but those lines will still have the line separator (a newline character, so \n) inscluded. Strip your lines:
for line in file_object:
file_name = ntpath.basename(line.strip())